CN113420166A

CN113420166A - Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment

Info

Publication number: CN113420166A
Application number: CN202110328304.0A
Authority: CN
Inventors: 雷陈奕; 王国鑫; 唐海红
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-09-21

Abstract

The embodiment of the invention provides a commodity mounting, retrieving, recommending and training processing method, a commodity mounting, retrieving, recommending and training processing device and electronic equipment, wherein the training method comprises the following steps: obtaining samples related to the content in a plurality of content domains to form a plurality of cross-domain sample combinations; using the trained feature extraction model to perform feature extraction on original data of the samples in the cross-domain sample combination to generate feature vectors corresponding to the samples; and carrying out comparison learning training on the feature extraction model to reduce the distance between the feature vectors corresponding to the samples in the cross-domain sample combination and amplify the distance between the feature vectors corresponding to the samples in the cross-domain sample combination to serve as a training target. The embodiment of the invention trains the feature extraction model through comparison learning to realize the feature extraction of data of different content domains, the feature vector has the property of cross-domain alignment, further processing such as feature comparison can be carried out, and the matching effect of cross-domain content retrieval and recommendation is effectively improved.

Description

Commodity mounting, retrieving, recommending and training processing method and device and electronic equipment

Technical Field

The application relates to a commodity mounting, retrieving, recommending and training processing method and device and electronic equipment, and belongs to the technical field of computers.

Background

With the development of internet technology, in the e-commerce field, the participation degree of multimedia content is higher and higher, for example, related commodities are mounted in live broadcast, short video and other modes, so that a better propaganda effect is achieved. From the perspective of the user, the related knowledge and application scene of the commodity can be better known through rich multimedia content, so that rich information for paying is provided for the user to select.

However, since the multimedia data and the commodity data belong to different content fields, they have different organization forms. The multimedia data is more freely displayed without restriction, and can contain various creative elements, such as complex visual backgrounds, complex natural text titles and diversified presentation themes. The organization form of the commodity data generally follows the e-commerce constraint condition and is more structured, for example, the commodity background requires concise specification, the commodity title is piled up by keywords for the convenience of retrieval and hit, and the like. Therefore, there is a huge semantic gap between multimedia data and commodity data. Due to the existence of the semantic gap, the multimedia data and the commodity data are difficult to match and fuse in the prior art, so that better cross-domain content recommendation or commodity mounting is provided for users.

Disclosure of Invention

The embodiment of the invention provides a commodity mounting, retrieving, recommending and training processing method and device and electronic equipment, and aims to achieve cross-content-domain feature comparison processing.

In order to achieve the above object, an embodiment of the present invention provides a training method for a feature extraction model, including:

obtaining samples related to the content in a plurality of content domains to form a plurality of cross-domain sample combinations;

using the trained feature extraction model to perform feature extraction on original data of the samples in the cross-domain sample combination to generate feature vectors corresponding to the samples;

and carrying out comparison learning training on the feature extraction model to reduce the distance between the feature vectors corresponding to the samples in the cross-domain sample combination and amplify the distance between the feature vectors corresponding to the samples in the cross-domain sample combination to serve as a training target.

The embodiment of the invention provides a commodity mounting processing method, which comprises the following steps:

acquiring multimedia data, performing cross-domain feature extraction on the multimedia data, and generating a multimedia feature vector corresponding to the multimedia data, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner;

acquiring a plurality of candidate commodity data, and performing cross-domain feature extraction on the candidate commodity data to generate a plurality of commodity feature vectors;

and determining commodity data for mounting from the candidate commodity data according to the correlation degree between the plurality of commodity feature vectors and the multimedia feature vector.

retrieving commodity data according to the multimedia feature vectors, and acquiring commodity data corresponding to the commodity feature vectors with the correlation degree of the multimedia feature vectors larger than a preset threshold value;

and mounting the commodity data on the multimedia data.

The embodiment of the invention provides a recommendation processing method, which comprises the following steps:

acquiring historical multimedia data and/or historical commodity data accessed by a user in a historical manner;

performing cross-domain feature extraction on the historical multimedia data and/or the historical commodity data to generate corresponding historical multimedia feature vectors and/or historical commodity feature vectors, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain mode;

and retrieving multimedia data and/or commodity data according to the historical multimedia feature vector and/or the historical commodity feature vector, and acquiring the multimedia data and/or commodity data with the correlation degree between the multimedia feature vector and the commodity feature vector larger than a preset threshold value as recommendation data.

The embodiment of the invention provides a retrieval processing method, which comprises the following steps:

generating a query vector according to retrieval information input by a user;

according to the query vector, querying a commodity feature vector database and/or a multimedia feature vector database, and acquiring a commodity feature vector and/or a multimedia feature vector of which the first correlation degree with the query vector is greater than a preset first threshold, wherein the commodity feature vector and/or the multimedia feature vector in the commodity feature vector database and/or the multimedia feature vector database are acquired based on cross-domain feature extraction, and the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner;

and returning corresponding commodity data and/or multimedia data to the user as a retrieval result according to the obtained commodity feature vector and/or multimedia feature vector.

The embodiment of the invention provides a training device of a feature extraction model, which comprises:

the system comprises a sample acquisition module, a cross-domain sample combination module and a content analysis module, wherein the sample acquisition module is used for acquiring samples related to content in a plurality of content domains to form a plurality of cross-domain sample combinations;

the characteristic vector generation module is used for extracting the characteristics of the original data of the samples in the cross-domain sample combination by using the trained characteristic extraction model to generate the characteristic vector corresponding to the samples;

and the training module is used for carrying out comparison learning training on the feature extraction model to reduce the distance between the feature vectors corresponding to the samples in the cross-domain sample combination and amplify the distance between the feature vectors corresponding to the samples in the cross-domain sample combination to serve as a training target.

An embodiment of the present invention provides a processing apparatus for commodity mounting, including:

the multimedia feature extraction module is used for acquiring multimedia data, performing cross-domain feature extraction on the multimedia data and generating a multimedia feature vector corresponding to the multimedia data, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner;

the commodity feature extraction module is used for acquiring a plurality of candidate commodity data, performing cross-domain feature extraction on the candidate commodity data and generating a plurality of commodity feature vectors;

and the relevancy processing module is used for determining commodity data for mounting from the candidate commodity data according to the relevancy between the commodity feature vectors and the multimedia feature vector.

An embodiment of the present invention provides a processing apparatus for associated commodity mounting, including:

the commodity retrieval processing module is used for retrieving commodity data according to the multimedia characteristic vectors and acquiring commodity data corresponding to the commodity characteristic vectors of which the correlation degrees are greater than a preset threshold value;

and the mounting processing module is used for mounting the commodity data on the multimedia data.

An embodiment of the present invention provides a recommendation processing apparatus, including:

the historical data acquisition module is used for acquiring historical multimedia data and/or historical commodity data accessed by a user in a historical manner;

the historical data feature extraction module is used for performing cross-domain feature extraction on the historical multimedia data and/or the historical commodity data to generate corresponding historical multimedia feature vectors and/or historical commodity feature vectors, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain mode;

and the data retrieval processing module is used for retrieving multimedia data and/or commodity data according to the historical multimedia feature vector and/or the historical commodity feature vector, and acquiring the multimedia data and/or commodity data with the correlation degree between the multimedia feature vector and the commodity feature vector larger than a preset threshold value as recommendation data.

An embodiment of the present invention provides a search processing apparatus, including:

the query vector generation module is used for generating a query vector according to the retrieval information input by the user;

the query vector retrieval module is used for querying in a commodity feature vector database and/or a multimedia feature vector database according to the query vector to obtain a commodity feature vector and/or a multimedia feature vector of which the first correlation degree with the query vector is greater than a preset first threshold, wherein the commodity feature vector and/or the multimedia feature vector in the commodity feature vector database and/or the multimedia feature vector database are obtained based on cross-domain feature extraction, and the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner;

and the retrieval result feedback module is used for returning corresponding commodity data and/or multimedia data to the user as a retrieval result according to the obtained commodity characteristic vector and/or multimedia characteristic vector.

acquiring multimedia data uploaded by a user;

selecting matched commodity data from a commodity database according to content features obtained by performing cross-domain feature extraction on the multimedia data, and recommending the commodity data to the user;

and responding to the selection of the commodity data of the user, and mounting the selected commodity data into the multimedia data.

acquiring multimedia data uploaded by a user and a plurality of commodity data to be mounted;

acquiring the correlation between the commodity data and the multimedia data according to the content characteristics obtained by performing cross-domain characteristic extraction on the multimedia data and the commodity characteristics obtained by performing cross-domain characteristic extraction on the commodity data;

and recommending commodity data from a plurality of commodity data to be mounted according to the correlation.

An embodiment of the present invention provides an electronic device, including:

a memory for storing a program;

a processor for executing the program stored in the memory to execute the aforementioned training method of the feature extraction model, and/or the aforementioned processing method of the commodity mounting, and/or the aforementioned recommendation processing method, and/or the aforementioned retrieval processing method.

According to the method, the device and the electronic equipment provided by the embodiment of the invention, the cross-content feature extraction model is trained through a comparison learning training mode, so that the feature extraction of data of different content domains is realized, the extracted feature vectors have the cross-domain alignment property, the further processing such as feature comparison can be carried out, and the cross-domain content retrieval and recommendation matching effect is effectively improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

FIG. 1 is a processing system for training a feature extraction model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an application scenario of a feature extraction model according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a processing method for mounting a commodity according to an embodiment of the present invention;

fig. 4 is a schematic view of an application scenario of the processing method for commodity mounting according to the embodiment of the present invention;

fig. 5 is a second flowchart illustrating a processing method for goods mounting according to an embodiment of the invention;

FIG. 6 is a flowchart illustrating a recommendation processing method according to an embodiment of the invention;

FIG. 7 is a schematic diagram of an application scenario of a recommendation processing method according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a retrieval processing method according to an embodiment of the present invention;

fig. 9 is a schematic view of an application scenario of the retrieval processing method according to the embodiment of the present invention;

FIG. 10 is a schematic flow chart of a training method of a feature extraction model according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a commodity mounting processing device according to an embodiment of the present invention;

fig. 12 is a second schematic structural view of a commodity mounting processing device according to an embodiment of the present invention;

FIG. 13 is a schematic structural diagram of a recommendation processing device according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating a structure of a search processing apparatus according to an embodiment of the present invention;

FIG. 15 is a schematic structural diagram of a training apparatus for feature extraction model according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The technical solution of the present invention is further illustrated by some specific examples.

In order to implement matching and fusion between cross-domain data, the embodiment of the invention provides a training method of a feature extraction model, the method trains the feature extraction model based on a contrast learning and counterstudy technology, and the trained model can implement cross-domain multi-modal data matching and fusion, so that semantic differences between multimedia data and commodity data are aligned, and cross-domain data comparison and processing are implemented.

Fig. 1 shows a processing system for training a feature extraction model according to an embodiment of the present invention. The feature extraction model is a trained model, the training samples are input into the feature extraction model to generate feature vectors, the comparison learning training module and the countercheck learning training module are used for training based on a training target function, the feature extraction model is updated according to training feedback results, and the feature vectors output by the feature extraction model are gradually close to training targets of comparison learning and countercheck learning through multi-round training iterative processing.

In the embodiment of the present invention, the multimedia domain and the commodity domain are taken as examples to illustrate how to solve the problem of fusion and matching of cross-domain multimodal. The trained feature extraction model is also applied to feature extraction processing on the data of the two fields, and correspondingly, the training samples are also the data from the two fields.

The data can be in a multi-modal form regardless of the multimedia domain or the product domain, and in this embodiment, the multi-modal data includes image data and text data. In the multimedia domain, the image data may include, for example, images of short videos uploaded by a user, live images, photos in social media such as a microblog, and the like, and the text data may be subtitles in the videos and text content on the social media. Similarly, in the goods field, the image data may include, for example, a video, a picture, etc. that introduces the goods, and the text data may include a title of the goods, a content that introduces the goods, etc.

Although the data of the two modalities, image data and text data, can be contained in both the goods domain and the multimedia domain, the organization form is very different, the presentation form of the multimedia data is more free and free from constraints, and the organization form of the goods data generally follows the e-commerce constraint condition and is more structured. For example, short videos come from life shooting of users, visual elements are rich, scene is strong, introduction videos in a commodity domain are simple in background, commodities need to be highlighted, images are visual, emphasis is highlighted, and the requirements of an e-commerce platform such as time length and background requirements are met. In terms of text data, text data in the multimedia domain generally conforms to the habit of natural language and can describe the feelings of many users, and the like, while text data in the commodity domain can have some structured non-natural languages, for example, a commodity title stacks many keywords for hit retrieval, and words of a commodity description can also describe the commodity itself more simply and directly. Because of the organization form products in the two fields, a large semantic gap can be brought, and the semantic features of the two fields are difficult to compare through simple feature extraction aiming at images or texts, so that the semantic features cannot be compared.

In the embodiment of the invention, the purpose of training the feature extraction model is to align the extracted features in order to span semantic gaps in different fields. The input of the feature extraction model is the original data of the commodity domain or the multimedia domain, after the feature extraction model is extracted, a feature vector corresponding to the original data is generated, the feature vector has the characteristic of cross-domain alignment, namely, no matter whether the original data is from the commodity domain or the multimedia domain, the feature vector extracted by the model can be in the same vector space, and comparison or vector operation of various correlation degrees and the like can be carried out. The original data may include two paths of data, namely image data and text data, and after the processing of the feature extraction model, an image feature vector and a text feature vector are generated, and the feature vectors of the two modalities also have alignment characteristics, so that cross-modality comparison or vector operation can be performed.

For the content structure of the feature extraction model, a model structure for performing feature extraction and fusion processing on image data and text data is mainly adopted, and for the image data, for example, a visual feature extraction model, for example, a model such as a pretrained ResNet (Residual neural network) is adopted for feature extraction. For such multiple visual features, max pooling processing may be performed on the visual features, so as to obtain unique feature representations of all image data for a certain multimedia data or commodity data, that is, a unique feature vector is fused into one image feature vector. On the other hand, for text data, similar to image data, feature extraction may be performed by using a text feature extraction model, for example, using a pre-trained BERT (Bidirectional Encoder representation based on transformer) model, or an end-to-end machine learning model, where each word in the text is corresponding to a globally unique id, then each id is subjected to an embedding operation, and finally all word embedding operations related to the input text are subjected to a normalizing operation, so as to obtain a unique feature representation, i.e., a text feature vector, where the normalizing operation may be, for example, a max posing (maximum pooling) process or a posing (pooling) process based on an attention mechanism.

In the example shown in fig. 1, the commodity data is represented by p, the multimedia data is represented by s, and after feature extraction is performed by the feature extraction model, the image feature vector corresponding to the commodity data is represented by v^pAnd the text feature vector corresponding to the commodity data is represented as t^pThe image feature vector corresponding to the multimedia data is represented as v^sThe text feature vector corresponding to the multimedia data is represented as t^s。

The model training process can be divided into contrast learning and confrontation learning. In the process of contrast learning, two sample pairs from the training sample set are selected as a contrast learning group. Each sample pair contains multimedia data and commodity data with strong content relevance, and the content relevance between the two sample pairs in the comparison learning group is weak.

Each sample pair may be generated by commodity data mounted in existing multimedia data (correlation between a commodity and multimedia is already confirmed), for example, if a certain commodity is mounted in a certain short video, the multimedia data corresponding to the short video and the commodity data corresponding to the commodity form a sample pair. The two sample pairs in the comparative learning group can be obtained by randomly selecting from the training sample set. It should be noted that a plurality of products mounted on a single video can form sample pairs with the video, but since all of the sample pairs are related to the video, the content thereof has a strong correlation, and a contrast learning group cannot be formed. Therefore, in the process of forming the contrast learning group, the two selected sample pairs should exclude the situation of hanging on the same video. Of course, the sample pairs and the contrast learning groups may be generated by manual or computer screening, as long as the basic principle that the contents in the sample pairs are strongly correlated and the contents between the sample pairs are weakly correlated is satisfied.

A specific contrast learning process is described below, wherein for a given arbitrary positive sample pair { s ] comprising multimedia data samples and commodity data samples associated in content₊，p₊And randomly selecting another negative sample pair(s) consisting of multimedia data samples and commodity data samples related to the contents from the whole content pool serving as the training set_{_}，p_{_}And forming a group of comparison learning groups. It should be noted that, as a training strategy, the purpose of random selection is to ensure a high probability of weak correlation between selected sample pairs. The "+" and "-" are mainly used to indicate that the two sample pairs are relatively weak in content and form a contrast relationship.

For the selected comparative learning group, extracting model by using trained featuresPerforming the feature extraction process to generate two sets of multi-modal feature vectors,

and

wherein the content of the first and second substances,

respectively represent the image feature vector corresponding to the multimedia data, the text feature vector corresponding to the multimedia data, the image feature vector corresponding to the commodity data and the text feature vector corresponding to the commodity data extracted from the positive sample pair, and similarly,

and respectively representing an image feature vector corresponding to the multimedia data, a text feature vector corresponding to the multimedia data, an image feature vector corresponding to the commodity data and a text feature vector corresponding to the commodity data extracted from the positive sample pair.

The learning process of the contrast training is performed based on the feature vectors described above, and the learning aims to minimize the distance between the multimodal vectors across the domains in the positive sample pair, while maximizing the distance between the multimodal vectors across the domains in the negative sample pair. The distance between vectors can be measured in Euclidean distance, i.e., d (x, y) | | x-y | | non-calculation²Where x and y represent any two vectors, which may be any feature vector in the above-mentioned vector set.

For visual modalities, i.e., for the image feature vectors in the positive and negative sample pairs described above, the following feature constraints for the dual domain can be employed as a loss function in terms of image features.

Wherein the function f_hIs defined as f_h(x₁，x₂)＝max(0，μ+x₁-x₂) Mu is a boundary distance which is a positive number or 0, and a specific value can be set according to actual needs and is mainly used for adjusting x₁And x₂The expected difference between the models, thereby enabling the models to have a certain fault tolerance space. It contains four f as shown in formula (1)_hSummation operation at each f_hThe function includes the distance calculation of two sets of vectors, where the vectors of the first set are cross-domain vector distances inside the sample pairs, and the vectors of the second set are cross-domain vector distances between the sample pairs. To be provided with

For example, in which,

the distance between the multimedia data of the content of the positive sample pair and the image feature vector of the commodity data is calculated, and

the distance between the image feature vector of the multimedia data in the positive sample pair and the image feature vector of the commodity data in the negative sample pair is calculated, and it is desirable to minimize the distance from the training target

To maximize the value of

I.e., reduce cross-domain vector distance inside the sample pair, and expand cross-domain vector distance between the sample pair. For the aspect of image features, the training target is the function of minimizing loss

The value of (3) is most preferably 0. Similarly, the following feature constraints for the dual domain may be determined as a loss function in terms of text features:

the definition of each function in the above formula (2) can refer to formula (1), except that all the feature vectors are replaced by text feature vectors. For the aspect of text features, the training target is to minimize the loss function

The value of (3) is most preferably 0.

Combining the dual domain constraints of the text features and the image features into an overall feature constraint for the multi-modal features, the overall loss function may be:

the overall objective of the comparison learning is to minimize the loss function of formula (3) by adjusting the model parameters in the feature extraction model through continuous iterative processing, and by using the feature vectors generated by the feature extraction model. The final training result is as illustrated in the comparative learning training module in fig. 1, where the distance between cross-domain feature vectors in the same sample pair is relatively small and the distance between cross-domain feature vectors in different sample pairs is relatively large in forming the text feature vector space and the image feature vector space.

The training process of contrast learning is described above, and the training process of counterlearning is described below. In the embodiment of the invention, the counterlearning is used as a supplement to the comparative learning training, and the aim is to better overcome the cross-domain multi-modal semantic gap. In the countercheck learning process, the content domain to which the feature vector belongs is identified by setting the discriminator, and the training target of the countercheck learning training module to the feature extraction model is that the discriminator cannot accurately discriminate the content domain to which the feature vector belongs, in other words, the training of the countercheck learning part actually increases certain difficulty for the comparison learning, so as to improve the training effect of the comparison learning, and enable the feature vectors extracted from different content domains to be aligned better.

Specifically, the counterlearning training module may be composed of two discriminators D_vAnd D_tWherein D is_vFor discriminating whether the input image feature vector originates from a multimedia domain or a commodity domain, analogously, D_tAnd is used for judging whether the input text feature vector is from a multimedia domain or a commodity domain. In the example of fig. 1, the training goal of the counterlearning training module is to expect that the discriminator cannot accurately discriminate whether the feature vector is from the multimedia domain or the merchandise domain. Discriminator D_vAnd D_tA model structure combining MLP (Multi-Layer Perceptron) and GRL (gradient reverse Layer) may be adopted.

In the example shown in fig. 1, z represents a sample label value, which takes a value of 0 or 1, 0 representing from the multimedia domain and 1 representing from the goods domain.

The value range of the discrimination value which represents the actual output of the discriminator is between 0 and 1, which is equivalent to the probability that the discriminated feature vector belongs to the multimedia domain or the commodity domain.

The loss function against learning can be defined in the form:

the model parameters in the feature extraction model are recorded as theta_EThe model parameter in the discriminator is recorded as theta_COn the basis of contrast learning, after the confrontation learning is fused, the training process can be expressed as min-max (maximum minimum algorithm) as follows:

wherein the content of the first and second substances,

and

representing model parameters obtained by training, wherein the right hand side of the equation

And

representing the fixed model parameters from the previous round of training, and to the left of the equation

And

representing the model parameters obtained by the training of the current round. L is_d(θ_E) And

the loss function in the case where the bracketed model parameter is used as the feature extraction model, that is, the feature vector extracted in the case where the bracketed model parameter is used is expressed and substituted into the formula (3) to be calculated. In the same way, L_adv(theta. C) and

indicating the loss function in the case of the discriminator using model parameters in parentheses, i.e. determining in the case of model parameters in parentheses

And z value, and substituting into the formula (4) to calculate.

In the training process shown in equation (5), where the equations in the first row are used to train the model parameters of the feature extraction model, the training objective is such that

Minimum value, wherein the right side of the equation

The value is not changed in the round of training and is regarded as a fixed value, and the training target is realized by adjusting the model parameters of the feature extraction model in the training process. Due to the fact that

Does not change during the training process as long as L_d(θ_E) The smaller the size of the tube is,

the smaller the value of the feature vector is, therefore, the training processing of the part is mainly based on the training of contrast learning, namely, the cross-domain feature extraction capability of the feature extraction model is improved, the alignment of the cross-domain feature vectors is realized, and the domain corresponding to the feature vectors cannot be accurately identified by the discriminator.

The second line of equations in equation (5) is used to train the model parameters of the discriminators with the goal being such that

Maximum value, where the equation is right

The value is not changed in the round of training and is regarded as a fixed value, and the training target is realized by adjusting the model parameters of the discriminator in the training process. Due to the fact that

Does not change during the training process as long as

The smaller the size of the tube is,

the larger the value of (a), therefore, the training process of this part is mainly directed to the training of the discriminator, i.e., the discrimination capability of the discriminator for which field the feature vector comes from is improved, thereby adding to the contrast learningAdding difficulty.

In the model training process based on the formula (5), the training processes corresponding to the first line and the second line are alternately performed, and the model parameters of the feature extraction model and the discriminator are continuously changed, so that the fusion of contrast learning and counterlearning is realized.

The feature extraction model trained based on the training mode of the contrast learning or the combination of the contrast learning and the contrast learning can be applied to the aspects of commodity mounting, commodity recommendation of users, commodity retrieval and the like for multimedia. Fig. 2 is a schematic view of an application scenario of the feature extraction model according to the embodiment of the present invention. The feature extraction model of the embodiment of the invention can be applied to a platform with a content providing service, and an e-commerce platform is taken as an example in the figure for explanation. In the embodiment of the invention, the characteristic extraction model can be configured in the retrieval/recommendation engine, so that the characteristic extraction is carried out in the process of carrying out retrieval and recommendation processing on the commodity data or the multimedia data, and the cross-domain characteristic processing is realized. In addition, the e-commerce platform can also pre-process a database in which the commodity data and the multimedia data are stored in advance to form a vector database corresponding to the commodity data and the multimedia data, so that the subsequent comparison processing in the retrieval and recommendation processes is facilitated. In addition, in the live broadcast processing, feature extraction may be performed on the live broadcast video based on the feature extraction model provided in the embodiment of the present invention, and the matching commodity data is acquired based on the search engine to be mounted. The application of the above feature extraction model to live broadcast services and retrieval recommendation services will be described in detail below.

As shown in fig. 3, which is one of the flow diagrams of the processing method for goods mounting according to the embodiment of the present invention, and as shown in fig. 4, which is an application scenario diagram of the processing method for goods mounting according to the embodiment of the present invention, the method may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used for mounting goods data with a certain relevance to multimedia data, and specifically, the method may include:

s101: acquiring multimedia data, performing cross-domain feature extraction on the multimedia data, and generating a multimedia feature vector corresponding to the multimedia data. Wherein the cross-domain feature extraction enables cross-domain alignment of the generated feature vectors. The multimedia data in the embodiment of the invention can comprise videos published by video websites, short videos published by a social platform, live videos of a live platform and the like, and the application scene of the embodiment can search suitable mounted commodities for the videos when the videos are published, and can also match the published videos with the suitable mounted commodities. The above-mentioned cross-domain feature extraction can be implemented by using a feature extraction model obtained based on the training method described above. In this embodiment, the cross-domain mainly refers to a cross-multimedia domain and a commodity domain, and as described above, the difference between the content organization forms of the two domains is large, and by the cross-domain feature extraction processing provided by the embodiment of the present invention, feature vectors extracted for the two content domains can be in the same vector space, and can be subjected to comparison or fusion based on semantic relevance, that is, the feature of cross-domain alignment is provided.

S102: and acquiring a plurality of candidate commodity data, and performing cross-domain feature extraction on the candidate commodity data to generate a plurality of commodity feature vectors. The candidate commodity data may be determined by a user selecting an operation in the commodity database, for example, after the user takes a short video, the short video is uploaded to a social platform or an e-commerce platform, and some commodities desired to be mounted are selected, in this case, the commodity selected by the user is not necessarily matched with the current video, and may be the commodity mounted at will only for increasing the click rate, so that a certain examination needs to be performed on one side of the platform to avoid a situation that a large number of mounted commodities are not matched with the video. In addition, the candidate commodity data may also be recommendations from some merchants or recommendations based on platform advertisement strategies, and the like, and then further screening is performed by using the method of the embodiment. The cross-domain feature extraction in this step can also be implemented by using a feature extraction model obtained based on the training method described above, and the extracted commodity feature vector can be aligned with the multimedia feature vector extracted in the previous step, so that the correlation between vectors can be calculated.

S103: and determining commodity data for mounting from the candidate commodity data according to the correlation degree between the plurality of commodity feature vectors and the multimedia feature vector. After the commodity feature vectors and the multimedia feature vectors are extracted, the relevancy can be determined by calculating the distance between the vectors, and the relevancy between the vectors represents the relevancy between the candidate commodity data and the multimedia data, so that commodity data with the relevancy ranking higher can be selected from the candidate commodity data for mounting.

In connection with the example interface of fig. 4, an interactive interface for video uploading or live broadcasting by the user may be included on the user terminal, and a functional area for selecting mounted commodities may be included on the lower portion, and the user may select from commodity items as candidate commodities in the lower portion and add the commodity items to the video screen, for example, the commodity items may be added to the area at the lower left of the video screen, so that when the viewer watches the video or the live broadcasting, the commodity data may be accessed by clicking a commodity thumbnail displayed in the area at the lower left. The function control area in the upper interactive interface can be used for adjusting the video picture and the thumbnail display position of the mounted commodity and the like. The commodity items in the mounted commodity selection area can come from the selection of the user, also can come from some recommendations of platform merchants and the like. After selecting some commodities as candidate commodity items, the user can mount the commodities in the video picture only after the commodities are processed by the method, namely the commodities need to be screened from the candidate commodities according to the relevance and then mounted.

The above-mentioned commodity feature vector and/or multimedia feature vector may include an image feature vector extracted based on image data and/or a text feature vector extracted based on text data. Specifically, the candidate commodity data and the multimedia data may further include data of two modalities, namely image data and text data, where the image data may include pictures, videos, and the like, the text data may include text contents, titles, subtitles, and the like, and in addition, the audio data may be pre-converted into text data. The data of the two modes can be subjected to cross-domain feature extraction processing respectively to generate an image feature vector and a text feature vector. The image characteristic vectors and the text characteristic vectors corresponding to the commodity data and the multimedia data can be respectively and correspondingly subjected to relevance calculation of the characteristic vectors, and can also be subjected to cross relevance calculation. As described above, the feature extraction model according to the embodiment of the present invention may be based on a pre-trained feature extraction model with cross-modal feature extraction capability, and therefore, the correlation between the extracted image feature vector and the text-based data may also be calculated, so as to implement cross-domain feature comparison on a cross-modal basis.

The commodity mounting processing method provided by the embodiment of the invention is based on cross-domain feature extraction processing, generates commodity feature vectors corresponding to a commodity domain and multimedia feature vectors corresponding to a multimedia domain, and performs cross-domain feature comparison, so that candidate commodity data are screened, and commodity data with better matching with the multimedia data are mounted on media data, so that a user can obtain more valuable commodity information when browsing video or live broadcast and other multimedia contents.

As shown in fig. 5, which is a second flowchart of a processing method for mounting a commodity according to an embodiment of the present invention, the method may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used to mount commodity data with a certain relevance to multimedia data, and specifically, the method may include:

s201: the method comprises the steps of obtaining multimedia data, carrying out cross-domain feature extraction on the multimedia data, and generating multimedia feature vectors corresponding to the multimedia data, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain mode. The processing of this step is identical to step S101 in the previous embodiment.

S202: and retrieving commodity data according to the multimedia feature vectors, and acquiring the commodity data corresponding to the commodity feature vectors with the correlation degree of the multimedia feature vectors larger than a preset threshold value. The embodiment can be applied to scenes such as e-commerce platforms or social platforms, which select commodities for mounting for videos uploaded by users or live broadcasts. One side of the platform can be used for constructing a commodity feature vector database in advance, namely, the existing commodity data in the E-commerce platform uses the feature extraction model provided by the embodiment of the invention to extract commodity feature vectors with cross-domain alignment property in advance and form the commodity feature vector database. Correspondingly, the step may specifically include: and searching in a commodity feature vector database according to the multimedia feature vector to obtain a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset threshold value, and then obtaining corresponding commodity data according to the commodity feature vector.

S203: and mounting the commodity data on the multimedia data. The retrieved commodity data can be directly mounted on the multimedia data, in addition, the retrieved commodity data can be firstly recommended to the user as candidate commodity data, and then the mounting processing of the commodity data is executed according to the selection of the candidate commodity data by the user, so that certain autonomous selectivity is provided for the user.

Similar to the foregoing embodiments, the above-mentioned commodity feature vector and/or multimedia feature vector may include an image feature vector extracted based on image data and/or a text feature vector extracted based on text data. The image characteristic vectors and the text characteristic vectors corresponding to the commodity data and the multimedia data can be respectively and correspondingly subjected to relevance calculation of the characteristic vectors, and can also be subjected to cross relevance calculation.

The commodity mounting processing method provided by the embodiment of the invention generates commodity characteristic vectors corresponding to the commodity domain and multimedia characteristic vectors corresponding to the multimedia domain based on cross-domain characteristic extraction processing, and performs cross-domain retrieval based on the characteristic vectors, so that commodity data with better matching with the multimedia data is acquired to be mounted on media data, and thus, a user can obtain more valuable commodity information when browsing video or live broadcasting and other multimedia contents.

As shown in fig. 6, which is a schematic flow diagram of a recommendation processing method according to an embodiment of the present invention, and as shown in fig. 7, which is a schematic application scenario diagram of the recommendation processing method according to the embodiment of the present invention, the method may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used for actively recommending content to a user, and specifically, the method may include:

s301: historical multimedia data and/or historical commodity data of historical access of a user are obtained. In the application scenario of this embodiment, active recommendation may be performed based on the historical access behavior of the user, for example, the user opens a page of an e-commerce APP, and some related content recommendations may be presented to the user on the top page without inputting any keywords. The historical multimedia data and/or the historical commodity data can be used for recording the historical access behaviors of the user under the condition that the user definite permission is obtained through the APP of the user, and are reported to the e-commerce platform so as to provide better content recommendation experience for the user.

S302: performing cross-domain feature extraction on the historical multimedia data and/or the historical commodity data to generate corresponding historical multimedia feature vectors and/or historical commodity feature vectors, wherein the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain mode. In this step, the cross-domain feature extraction process is the same as the process of extracting features of the multimedia data and the commodity data mentioned in the foregoing embodiment, and can be implemented by using a feature extraction model obtained based on the training method described above. The historical multimedia data and/or the historical commodity data may include image data and/or text data, and accordingly, the commodity feature vector and/or the multimedia feature vector may include an image feature vector extracted based on the image data and/or a text feature vector extracted based on the text data.

S303: and retrieving multimedia data and/or commodity data according to the historical multimedia feature vectors and/or the historical commodity feature vectors, and acquiring the multimedia data and/or commodity data with the correlation degree between the multimedia feature vectors and/or the commodity feature vectors being larger than a preset threshold value as recommendation data. After the commodity feature vectors and the multimedia feature vectors are extracted, the relevance can be determined by calculating the distance between the vectors, and the relevance between the vectors represents the relevance between the commodity data and the multimedia data, so that the commodity data and the multimedia data which are high in relevance to the historical behaviors of the user can be retrieved, and reasonable content recommendation is performed for the user.

As shown in fig. 7, it shows a recommendation interface of an online shopping APP on the side of a user terminal, and a user can generate various types of recommendation data on the interface by the above recommendation processing method without inputting any information and display the recommendation data to the user on a recommendation page. For example, on the application interface shown in fig. 7, some commodity items are recommended to the user, short videos uploaded by other users and live broadcasts in progress by merchants are also recommended, and associated commodity items are mounted inside both the short videos and the live videos as further recommendations.

In the above-described process of searching for multimedia data and/or product data, the search may be performed in a database in which multimedia feature vectors and/or product feature vectors are stored, which is generated in advance. The platform can pre-extract the feature vectors by using the stored multimedia data and/or commodity data and the feature extraction model provided by the embodiment of the invention to form a feature vector database, and on the basis, the platform can search in the database according to the historical multimedia feature vectors and/or historical commodity feature vectors of the user to obtain the multimedia feature vectors and/or commodity feature vectors of which the correlation degrees with the historical multimedia feature vectors and/or the historical commodity feature vectors are greater than a preset threshold value; and then, according to the obtained multimedia characteristic vector and/or commodity characteristic vector, determining corresponding multimedia data and/or commodity data as recommendation data.

In the historical access behavior of the user, a large amount of historical multimedia data and/or historical commodity data are generally generated, and accordingly, a plurality of historical multimedia feature vectors and/or historical commodity feature vectors extracted from the historical multimedia data and/or the historical commodity data are also generated. In the embodiment of the invention, the retrieval can be respectively carried out based on each multimedia characteristic vector and/or historical commodity characteristic vector, and the multimedia data and/or commodity data with more hit times are used as recommendation data. In addition, a plurality of historical multimedia feature vectors and/or historical commodity feature vectors can be subjected to fusion processing to generate fusion feature vectors, the fusion feature vectors are equivalent to that the semantic abstraction is performed on the overall historical behaviors of the user, then, the multimedia data and/or commodity data are retrieved according to the fusion feature vectors, and the multimedia data and/or commodity data with the correlation degree between the multimedia data and the fusion feature vectors larger than a preset threshold value are obtained.

The recommendation processing method provided by the embodiment of the invention is based on cross-domain feature extraction processing, performs cross-domain feature extraction on historical multimedia data and/or historical commodity data related to historical access behaviors of a user, and retrieves multimedia data and/or commodity data with better matching from the extracted feature vectors to recommend the multimedia data and/or commodity data to the user, so that the user can obtain better content recommendation experience.

Fig. 8 is a schematic flowchart of a retrieval processing method according to an embodiment of the present invention, and fig. 9 is a schematic application scenario diagram of the retrieval processing method according to an embodiment of the present invention, where the method may be applied to an e-commerce platform for performing content retrieval according to a keyword input by a user, and specifically, the method may include:

s401: and generating a query vector according to the retrieval information input by the user. The application scenario of the embodiment may be a search initiated by a user inputting search information through a search page of the e-commerce platform. For the feature extraction of the search information, the feature extraction model provided in the foregoing embodiment of the present invention may be used for performing the feature extraction.

S402: and according to the query vector, querying in a commodity feature vector database and/or a multimedia feature vector database to obtain the commodity feature vector and/or the multimedia feature vector of which the first correlation degree with the query vector is greater than a preset first threshold value. The commodity feature vectors and/or the multimedia feature vectors in the commodity feature vector database and/or the multimedia feature vector database are obtained based on cross-domain feature extraction, and specifically, the feature extraction model provided by the embodiment of the invention can be adopted, and the generated feature vectors can be aligned in a cross-domain manner through the cross-domain feature extraction. As described above, the e-commerce platform may extract feature vectors from multimedia data and commodity data in advance by using the feature extraction model provided in the embodiment of the present invention, so as to form a feature vector database. On this basis, the search information input by the user is converted into a query vector in step S401, and then the query vector can be compared with the vector in the vector database.

S403: and returning corresponding commodity data and/or multimedia data to the user as a retrieval result according to the obtained commodity feature vector and/or multimedia feature vector. The search result returned to the user may include multimedia data such as short video, ongoing live broadcast, and the like, or may include commodity data distributed on the e-commerce platform. In addition, the multimedia data may further include commodity data, such as a commodity link associated with a short video.

As shown in fig. 9, which shows a page retrieved by a user on an online shopping APP of a terminal, the user inputs keywords in a retrieval box of a left side interface, and returns to a right side page after the APP performs interaction processing with a server of an e-commerce platform, where a returned retrieval result includes items such as a commodity item, a short video hung with the commodity item, or live broadcast of a merchant.

Aiming at the commodity data mounted in the multimedia data, the E-commerce platform can judge the correlation degree between the mounted commodity data and the multimedia data when returning a retrieval result to a user, and reject some irrelevant commodities. Specifically, the method may further include: acquiring commodity feature vectors corresponding to commodity data mounted in the multimedia data; and acquiring a second correlation degree between the multimedia feature vector corresponding to the multimedia data and the commodity feature vector, if the second correlation degree is greater than or equal to a preset second threshold value, keeping the mounted commodity data in a retrieval result returned to the user, and otherwise, deleting the mounted commodity data.

In addition, for the multimedia data with no mounted commodity data or less mounted commodity data in the retrieval result, the related commodity data can be further acquired for mounting and recommended to the user. Specifically, in the case of returning multimedia data to the user, the method may further include: according to the multimedia feature vector corresponding to the multimedia data, inquiring in a commodity feature vector database to obtain a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset third threshold value; and acquiring corresponding commodity data according to the commodity feature vector, and mounting the commodity data into multimedia data as a retrieval result for providing.

According to the retrieval processing method provided by the embodiment of the invention, based on the feature vector database formed by cross-domain feature extraction processing, cross-domain feature vector matching query can be carried out according to the feature vector formed by retrieval information input by a user, and multimedia data and/or commodity data meeting the retrieval requirements of the user can be obtained, so that abundant retrieval results from different content domains can be provided for the user.

In the above embodiment, the application of the feature extraction model and the model training principle of the embodiment of the present invention are described by taking commodity data and multimedia data as two content fields as examples. In fact, the feature extraction model and the model training method of the embodiment of the invention can realize feature extraction of data of a plurality of content domains.

As shown in fig. 10, which is a schematic flow chart of a training method of a feature extraction model according to an embodiment of the present invention, the training method according to an embodiment of the present invention may be applied to an e-commerce platform, a search engine, and a data service platform providing cloud services, and is used to train a feature extraction model in a processing system such as a search engine and content recommendation, so as to implement feature extraction on data in different content domains, and enable extracted feature vectors to be aligned, thereby facilitating subsequent further processing such as feature comparison or fusion between content domains, and the method may include:

s501: and obtaining samples related to the content in the plurality of content domains to form a plurality of cross-domain sample combinations. To train the feature extraction model, a sample set may be constructed in advance. Wherein, the sample can be from data in any one content domain, such as a certain video from a multimedia domain or a certain commodity data from a commodity domain, etc. If there is a known correlation between these samples from different content domains, then these samples can be combined into a cross-domain sample combination and the user performs model training. In the foregoing example, the sample pair composed of the commodity data mounted in the multimedia data is taken as an example of the cross-domain sample combination, and in an actual application, since there may be a plurality of content domains, the number of samples in the cross-domain sample combination may also be a plurality.

S502: and performing feature extraction on the original data of the samples in the cross-domain sample combination by using the trained feature extraction model to generate feature vectors corresponding to the samples. The sample data of any content domain may include data of multiple modalities, where a modality refers to a form in which the data itself exists, and the modality may generally include: images, audio, text, etc. Therefore, the feature extraction processing in this step may include: and respectively extracting the features of the original data of each mode of the samples in the cross-domain sample combination to generate a plurality of mode feature vectors corresponding to the samples. In an embodiment of the present invention, image data and text data may be selected as main modalities, and audio data may be converted into text data in advance, that is, the raw data of the above sample may include image data and/or text data, and accordingly, the feature extraction process in this step may include: and respectively extracting the features of the image data and/or the text data of the samples in the cross-domain sample combination to generate image feature vectors and/or text feature vectors corresponding to the samples.

It should be noted that, in the embodiment of the present invention, the contrast learning and the counterlearning of the later study are mainly training for the cross-domain feature extraction capability, and in the case that there is a requirement for cross-modal feature extraction, the trained feature extraction model may adopt a pre-trained feature extraction model with a cross-modal feature extraction capability, for example, in terms of model selection, a pre-trained BERT model, an end-to-end machine learning model, or the like may be adopted for the text feature extraction portion, and a pre-trained ResNet model, or the like may be adopted for the image feature extraction portion.

S503: and carrying out comparison learning training on the feature extraction model to reduce the distance between the feature vectors corresponding to the samples in the cross-domain sample combination and amplify the distance between the feature vectors corresponding to the samples in the cross-domain sample combination to serve as a training target. In the foregoing, the corresponding training process, the loss function calculation, and the like have been described with commodity data and multimedia data as examples. For the case of more than two content fields, i.e. where there are multiple samples within a sample combination, the training and construction of the loss function can also be done in the manner referred to in the previous example. In the loss function, the distance calculation may still adopt a way of calculating the distance by using pairwise vectors, but the number of times of calculating the distance is more than that of the two content domains, but the overall principle is the same, that is, it is desirable that the distance of the feature vectors between different domains in the same sample combination is as small as possible, and the distance of the feature vectors between different domains in different sample combinations is as large as possible. The training process of the comparative learning may be performed with reference to the formula (2) and the formula (3) mentioned in the foregoing embodiments.

In addition, in order to improve the training effect of contrast learning, the embodiment of the invention can introduce counterlearning for auxiliary training. Accordingly, the above method may further comprise: and performing countermeasure learning training on the feature extraction model to reduce the discrimination accuracy of a discriminator of the countermeasure learning on the content domain to which the feature vector belongs as a training target. In this embodiment, the countermeasure learning is used to increase the difficulty of the contrast learning, and is used as a discriminator for checking the effect of the contrast learning, that is, when the discriminator for the countermeasure learning cannot accurately judge which content domain the feature vector extracted by the feature extraction model belongs to, it is equivalent to that the contrast learning has performed good alignment on the feature vectors of different content domains, and a better effect of cross-domain feature extraction is achieved.

Specifically, in the training process of the antagonistic learning, the following training processes may be alternately performed:

1) and fixing the model parameters of the discriminator, and training the feature extraction model to reduce the discrimination accuracy of the discriminator on the content domain to which the feature vector belongs. The training process of the part is to train the feature extraction model by combining the loss function of the comparison learning under the condition that the model parameters of the fixed discriminator are not changed, so that the cross-domain alignment of the feature vectors extracted by the feature extraction model is improved.

2) And fixing the model parameters of the feature extraction model, and training the discriminator to improve the discrimination accuracy of the discriminator on the content domain to which the feature vector belongs. In this part of the training process, in the case where the model parameters of the feature extraction model are fixed, the discrimination capability of the discriminator is improved, so that the training process in 1) above increases the difficulty.

The training process of the counterlearning can be trained by using the functions of the formula (4) and the formula (5) in the foregoing embodiments.

According to the model training method provided by the embodiment of the invention, the feature extraction of the data of different content domains is realized through the training mode of contrast learning or the training mode of the contrast learning combined with counterlearning, and the extracted feature vector has the property of cross-domain alignment, so that the further processing of feature comparison and the like can be carried out, and the cross-domain content retrieval and recommendation matching effect is effectively improved.

The embodiment of the invention also provides a commodity mounting processing method, which can be applied to an e-commerce platform or a social platform, particularly to a software product for providing multimedia data release for users as general consumers, and helps the users to recommend and select commodity mounting, and specifically, the method comprises the following steps:

s601: and acquiring the multimedia data uploaded by the user. The user in this embodiment may be a user who is a general consumer, the multimedia data uploaded by the user may be a homemade video, for example, content in terms of life dynamics or product use, and under the permission of the user, some associated goods may be recommended to the user and mounted in the video as advertisement recommendation, from which the user may also obtain a certain reward.

S602: and selecting matched commodity data from a commodity database according to the content characteristics obtained by performing cross-domain characteristic extraction on the multimedia data, and recommending the commodity data to the user. The above-described extraction process for the content features of the multimedia data may use the feature extraction model described in the foregoing embodiments.

S603: and responding to the selection of the commodity data of the user, and mounting the selected commodity data into the multimedia data. The platform can recommend a plurality of commodity data to the user, and the user can finally decide the commodity data mounted in the multimedia data according to the preference or the idea of the user.

In addition, after the multimedia data is published, a large amount of comment data may be received, in the embodiment of the present invention, the comment data may be obtained, and according to the characteristics of the comment data, the commodity data matched with the comment data is selected in the commodity database and recommended to the user, so as to update the commodity data mounted in the multimedia data. This process may be performed periodically as the comment data changes, so that the user can be helped to continuously update the commodity data mounted in the multimedia data.

For example, after a user uploads a video of a travel life, the initially mounted goods are products in the aspects of travel clothes and the like, due to the fact that the video shooting quality is good, a large amount of attention is paid, and meanwhile, a large amount of comments about the shooting aspect are involved, the platform can match out some goods in the aspects of shooting equipment or books according to new comment data, recommend the goods to the user, and perform goods mounting and updating, so that the user and a merchant are helped to achieve a win-win effect.

In addition, the comment data may also include some comment data of the product, and as the comment data of the product changes, the matching degree with the multimedia data may be affected, so that the product data recommended to the user may also be updated according to the comment data of the product.

The embodiment of the invention also provides a commodity mounting processing method, which can be applied to an e-commerce platform or a social platform, and particularly can be applied to a software product for providing multimedia data release for a merchant user, for example, to help the merchant user to select or check commodity mounting, and specifically comprises the following steps:

s701: the method comprises the steps of obtaining multimedia data uploaded by a user and a plurality of commodity data to be mounted. In the aspect of the merchant user, the uploaded multimedia data has strong purposiveness, namely the uploaded multimedia data is used for advertising the commodity of the merchant user, but the merchant user is possibly deficient in the matching judgment of the commodity data and the multimedia data.

S702: and acquiring the correlation between the commodity data and the multimedia data according to the content characteristics obtained by performing cross-domain characteristic extraction on the multimedia data and the commodity characteristics obtained by performing cross-domain characteristic extraction on the commodity data. The above-described extraction processing of various features may use the feature extraction model described in the foregoing embodiment, thereby forming cross-domain feature content capable of performing correlation calculation.

S703: and recommending commodity data from a plurality of commodity data to be mounted according to the correlation. Specifically, the relevance of the plurality of commodity data to the multimedia data may be ranked, and mountable commodity data may be recommended to the user based on the ranking, for example, a certain number of commodity data ranked top may be selected as mountable commodity data.

In addition, after the multimedia data is released, a large amount of comment data may be received, in the embodiment of the present invention, the comment data may be acquired, and the commodity data recommended to the user may be updated according to the feature obtained by performing cross-domain feature extraction on the comment data, for example, it may be suggested to the user whether to replace some mounted commodity data or to recreate the multimedia data. In addition, the comment data can also comprise some comment data of the commodity, and the matching degree of the comment data of the commodity and the multimedia data can be influenced along with the change of the comment data of the commodity, so that the recommendation of the commodity data to the user can be updated according to the comment data of the commodity.

In addition to the relevance between the multimedia data characteristics and the commodity characteristics mainly considered in the commodity data mounting recommendation process, the geographic position information of the user can be introduced to recommend commodity data mounting. For example, for some agricultural products, a situation that merchant users or ordinary consumer users are concentrated in one area often occurs, in this case, based on the characteristics of the area, some commodity data related to the area can be recommended to be mounted in the multimedia data, so that commodity sales in the area are promoted.

As shown in fig. 11, which is a schematic structural diagram of a processing apparatus for mounting a commodity according to an embodiment of the present invention, the apparatus may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used to mount commodity data with a certain relevance to multimedia data, and specifically, the apparatus may include:

the multimedia feature extraction module 11 is configured to acquire multimedia data, perform cross-domain feature extraction on the multimedia data, and generate a multimedia feature vector corresponding to the multimedia data, where the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner. The multimedia data in the embodiment of the invention can comprise videos published by video websites, short videos published by a social platform, live videos of a live platform and the like, and the application scene of the embodiment can search suitable mounted commodities for the videos when the videos are published, and can also match the published videos with the suitable mounted commodities. The above-mentioned cross-domain feature extraction can be implemented by using a feature extraction model obtained based on the training method described above. In this embodiment, the cross-domain mainly refers to a cross-multimedia domain and a commodity domain, and as described above, the difference between the content organization forms of the two domains is large, and by the cross-domain feature extraction processing provided by the embodiment of the present invention, feature vectors extracted for the two content domains can be in the same vector space, and can be subjected to comparison or fusion based on semantic relevance, that is, the feature of cross-domain alignment is provided.

The commodity feature extraction module 12 is configured to obtain a plurality of candidate commodity data, perform cross-domain feature extraction on the candidate commodity data, and generate a plurality of commodity feature vectors. The candidate commodity data may be determined from a user selecting an operation in the commodity database. In addition, the candidate good data may also be recommendations from some merchants or platform-based advertising strategies, etc. Then the device of the embodiment is used for further screening the candidate commodities. The cross-domain feature extraction in the above process can also be implemented by using a feature extraction model obtained based on the above-described training method, and the extracted commodity feature vector and the multimedia feature vector can be aligned, and the correlation between the vectors can be calculated.

And a correlation processing module 13, configured to determine commodity data to be mounted from the plurality of candidate commodity data according to a correlation between the plurality of commodity feature vectors and the multimedia feature vector. After the commodity feature vectors and the multimedia feature vectors are extracted, the relevancy can be determined by calculating the distance between the vectors, and the relevancy between the vectors represents the relevancy between the candidate commodity data and the multimedia data, so that commodity data with the relevancy ranking higher can be selected from the candidate commodity data for mounting.

The above-mentioned commodity feature vector and/or multimedia feature vector may include an image feature vector extracted based on image data and/or a text feature vector extracted based on text data. The image characteristic vectors and the text characteristic vectors corresponding to the commodity data and the multimedia data can be respectively and correspondingly subjected to relevance calculation of the characteristic vectors, and can also be subjected to cross relevance calculation. As described above, the feature extraction model according to the embodiment of the present invention may be based on a pre-trained feature extraction model with cross-modal feature extraction capability, and therefore, the correlation between the extracted image feature vector and the text-based data may also be calculated, so as to implement cross-domain feature comparison on a cross-modal basis.

The commodity mounting processing device provided by the embodiment of the invention generates commodity characteristic vectors corresponding to the commodity domain and multimedia characteristic vectors corresponding to the multimedia domain based on cross-domain characteristic extraction processing, performs cross-domain characteristic comparison, screens candidate commodity data, and mounts commodity data with better matching with the multimedia data onto media data, so that a user can obtain more valuable commodity information when browsing video or live broadcast and other multimedia contents.

As shown in fig. 12, which is a second schematic structural diagram of a processing apparatus for mounting a commodity according to an embodiment of the present invention, the apparatus may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used to mount commodity data with a certain relevance to multimedia data, and specifically, the apparatus may include:

the multimedia feature extraction module 21 is configured to acquire multimedia data, perform cross-domain feature extraction on the multimedia data, and generate a multimedia feature vector corresponding to the multimedia data, where the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner. The processing of this step is identical to the multimedia feature extraction module 11 in the previous embodiment.

And the commodity retrieval processing module 22 is configured to perform commodity data retrieval according to the multimedia feature vector, and acquire commodity data corresponding to a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset threshold value. The embodiment can be applied to scenes such as e-commerce platforms or social platforms, which select commodities for mounting for videos uploaded by users or live broadcasts. One side of the platform can be used for constructing a commodity feature vector database in advance, namely, the existing commodity data in the E-commerce platform uses the feature extraction model provided by the embodiment of the invention to extract commodity feature vectors with cross-domain alignment property in advance and form the commodity feature vector database. Accordingly, the processing of the module may specifically include: and searching in a commodity feature vector database according to the multimedia feature vector to obtain a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset threshold value, and then obtaining corresponding commodity data according to the commodity feature vector.

And the mounting processing module 23 is configured to mount the commodity data on the multimedia data. In addition, the retrieved commodity data can be recommended to the user as candidate commodity data, and then the mounting processing of the commodity data can be executed according to the selection of the candidate commodity data by the user, so that the user can select autonomously.

The commodity mounting processing device generates commodity characteristic vectors corresponding to the commodity domain and multimedia characteristic vectors corresponding to the multimedia domain based on cross-domain characteristic extraction processing, and conducts cross-domain retrieval based on the characteristic vectors, so that commodity data with good matching performance with the multimedia data are obtained to be mounted on media data, and therefore a user can obtain more valuable commodity information when browsing videos or live broadcast and other multimedia contents.

As shown in fig. 13, which is a schematic structural diagram of a recommendation processing apparatus according to an embodiment of the present invention, the apparatus may be applied to an e-commerce platform, a social platform with a multimedia publishing function, a live platform, and the like, and is used to actively recommend content to a user, and specifically, the apparatus may include:

a historical data acquiring module 31, configured to acquire historical multimedia data and/or historical commodity data accessed by a user in a historical manner. In the application scenario of this embodiment, active recommendation may be performed based on the historical access behavior of the user, for example, the user opens a page of an e-commerce APP, and some related content recommendations may be presented to the user on the top page without inputting any keywords. The historical multimedia data and/or the historical commodity data can be used for recording the historical access behaviors of the user under the condition that the user definite permission is obtained through the APP of the user, and are reported to the e-commerce platform so as to provide better content recommendation experience for the user.

And the historical data feature extraction module 32 is configured to perform cross-domain feature extraction on the historical multimedia data and/or the historical commodity data to generate corresponding historical multimedia feature vectors and/or historical commodity feature vectors, where the cross-domain feature extraction enables the generated feature vectors to be aligned in a cross-domain manner.

The cross-domain feature extraction process is the same as the process of extracting features of multimedia data and commodity data mentioned in the foregoing embodiment, and can be implemented by using a feature extraction model obtained based on the training method described above. The historical multimedia data and/or the historical commodity data may include image data and/or text data, and accordingly, the commodity feature vector and/or the multimedia feature vector may include an image feature vector extracted based on the image data and/or a text feature vector extracted based on the text data.

And the data retrieval processing module 33 is configured to perform retrieval on the multimedia data and/or the commodity data according to the historical multimedia feature vector and/or the historical commodity feature vector, and acquire the multimedia data and/or the commodity data of which the correlation degree with the multimedia feature vector and/or the commodity feature vector is greater than a preset threshold value as recommendation data. After the commodity feature vectors and the multimedia feature vectors are extracted, the relevance can be determined by calculating the distance between the vectors, and the relevance between the vectors represents the relevance between the commodity data and the multimedia data, so that the commodity data and the multimedia data which are high in relevance to the historical behaviors of the user can be retrieved, and reasonable content recommendation is performed for the user.

The recommendation processing device provided by the embodiment of the invention performs cross-domain feature extraction on historical multimedia data and/or historical commodity data related to historical access behaviors of a user based on cross-domain feature extraction processing, retrieves the multimedia data and/or commodity data with better matching from the extracted feature vector and recommends the multimedia data and/or commodity data to the user, so that the user can obtain better content recommendation experience.

As shown in fig. 14, which is a schematic structural diagram of a retrieval processing apparatus according to an embodiment of the present invention, the apparatus may be applied to an e-commerce platform for performing content retrieval according to a keyword input by a user, and specifically, the apparatus may include:

and a query vector generation module 41, configured to generate a query vector according to the search information input by the user. The application scenario of the embodiment may be a search initiated by a user inputting search information through a search page of the e-commerce platform. For the feature extraction of the search information, the feature extraction model provided in the foregoing embodiment of the present invention may be used for performing the feature extraction.

And the query vector retrieval module 42 is configured to perform query in the commodity feature vector database and/or the multimedia feature vector database according to the query vector, and obtain a commodity feature vector and/or a multimedia feature vector of which a first correlation degree with the query vector is greater than a preset first threshold, where the commodity feature vector and/or the multimedia feature vector in the commodity feature vector database and/or the multimedia feature vector database are obtained based on cross-domain feature extraction, and the cross-domain feature extraction enables the generated feature vectors to be aligned across domains. As described above, the e-commerce platform may extract feature vectors from multimedia data and commodity data in advance by using the feature extraction model provided in the embodiment of the present invention, so as to form a feature vector database. On the basis, the retrieval information input by the user is converted into a query vector, and then the query vector can be compared with the vector in the vector database.

And the retrieval result feedback module 43 is configured to return corresponding commodity data and/or multimedia data to the user as a retrieval result according to the obtained commodity feature vector and/or multimedia feature vector. The search result returned to the user may include multimedia data such as short video, ongoing live broadcast, and the like, or may include commodity data distributed on the e-commerce platform. In addition, the multimedia data may further include commodity data, such as a commodity link associated with a short video.

Aiming at the commodity data mounted in the multimedia data, the E-commerce platform can judge the correlation degree between the mounted commodity data and the multimedia data when returning a retrieval result to a user, and reject some irrelevant commodities. In addition, for the multimedia data with no mounted commodity data or less mounted commodity data in the retrieval result, the related commodity data can be further acquired for mounting and recommended to the user.

The retrieval processing device provided by the embodiment of the invention can perform cross-domain matching query on the feature vector database formed by cross-domain feature extraction processing according to the feature vector formed by retrieval information input by a user, and acquire multimedia data and/or commodity data meeting the retrieval requirements of the user, thereby providing rich retrieval results from different content domains for the user.

As shown in fig. 15, which is a schematic structural diagram of a training device for a feature extraction model according to an embodiment of the present invention, the training device according to an embodiment of the present invention may be applied to an e-commerce platform, a search engine, and a data service platform providing cloud services, and is configured to train a feature extraction model in a processing system such as a search engine and content recommendation, so as to implement feature extraction on data in different content domains, and enable extracted feature vectors to be aligned, thereby facilitating subsequent further processing such as feature comparison or fusion between content domains, and the device may include:

the sample obtaining module 51 is configured to obtain a plurality of cross-domain sample combinations composed of samples related to content in a plurality of content domains. To train the feature extraction model, a sample set may be constructed in advance. Wherein, the sample can be from data in any one content domain, such as a certain video from a multimedia domain or a certain commodity data from a commodity domain, etc. If there is a known correlation between these samples from different content domains, then these samples can be combined into a cross-domain sample combination and the user performs model training. In the foregoing example, the sample pair composed of the commodity data mounted in the multimedia data is taken as an example of the cross-domain sample combination, and in an actual application, since there may be a plurality of content domains, the number of samples in the cross-domain sample combination may also be a plurality.

And a feature vector generation module 52, configured to perform feature extraction on the original data of the samples in the cross-domain sample combination by using the trained feature extraction model, and generate a feature vector corresponding to the sample. The sample data of any content domain may include data of multiple modalities, where a modality refers to a form in which the data itself exists, and the modality may generally include: images, audio, text, etc. Thus, the feature extraction process may include: and respectively extracting the features of the original data of each mode of the samples in the cross-domain sample combination to generate a plurality of mode feature vectors corresponding to the samples. In an embodiment of the present invention, image data and text data may be selected as main modalities, and audio data may be converted into text data in advance, that is, the raw data of the sample may include image data and/or text data, and accordingly, the feature extraction process may include: and respectively extracting the features of the image data and/or the text data of the samples in the cross-domain sample combination to generate image feature vectors and/or text feature vectors corresponding to the samples.

The training module 53 is configured to perform comparison learning training on the feature extraction model to reduce the distance between the feature vectors corresponding to the samples in the cross-domain sample combination, and enlarge the distance between the feature vectors corresponding to the samples in the cross-domain sample combination as a training target.

In addition, in order to improve the training effect of contrast learning, the embodiment of the invention can introduce counterlearning for auxiliary training. Thus, the training module 53 may also be configured to: and performing countermeasure learning training on the feature extraction model to reduce the discrimination accuracy of a discriminator of the countermeasure learning on the content domain to which the feature vector belongs as a training target. In this embodiment, the countermeasure learning is used to increase the difficulty of the contrast learning, and is used as a discriminator for checking the effect of the contrast learning, that is, when the discriminator for the countermeasure learning cannot accurately judge which content domain the feature vector extracted by the feature extraction model belongs to, it is equivalent to that the contrast learning has performed good alignment on the feature vectors of different content domains, and a better effect of cross-domain feature extraction is achieved.

Specific examples of the contrast learning and the counterlearning have been described in the foregoing embodiments of the training method, and are not described herein again.

The model training device provided by the embodiment of the invention can align the trained feature extraction model by a training mode of contrast learning or the combination of contrast learning and counterlearning aiming at the feature vectors extracted from different content domains, thereby realizing cross-domain feature comparison and feature fusion and effectively improving the cross-domain content retrieval and recommended matching effect.

The foregoing embodiment describes the flow processes of the model training method, the commodity mounting processing method, the retrieval method, and the recommendation method, and the corresponding device structures, and the functions of the above methods and devices can be implemented by an electronic device, as shown in fig. 16, which is a schematic structural diagram of the electronic device according to the embodiment of the present invention, and specifically includes: a memory 110 and a processor 120.

And a memory 110 for storing a program.

In addition to the programs described above, the memory 110 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.

The memory 110 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A processor 120, coupled to the memory 110, for executing the program in the memory 110 to perform the operational steps of any one or more of the methods described in the previous embodiments.

Furthermore, the processor 120 may also include various modules described in the foregoing embodiments to perform the processing of any one or more of the methods described in the foregoing embodiments, and the memory 110 may be used, for example, to store data required by the modules to perform the operations and/or output data.

The detailed description of the above processing procedure, the detailed description of the technical principle, and the detailed analysis of the technical effect are described in the foregoing embodiments, and are not repeated herein.

Further, as shown, the electronic device may further include: communication components 130, power components 140, audio components 150, display 160, and other components. Only some of the components are schematically shown in the figure and it is not meant that the electronic device comprises only the components shown in the figure.

The communication component 130 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, a mobile communication network, such as 2G, 3G, 4G/LTE, 5G, or a combination thereof. In an exemplary embodiment, the communication component 130 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 130 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

The power supply component 140 provides power to the various components of the electronic device. The power components 140 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.

The audio component 150 is configured to output and/or input audio signals. For example, the audio component 150 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 110 or transmitted via the communication component 130. In some embodiments, audio assembly 150 also includes a speaker for outputting audio signals.

The display 160 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

Furthermore, an embodiment of the present invention further provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the processor is caused to implement the foregoing method.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The aforementioned program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A training method of a feature extraction model comprises the following steps:

2. The method of claim 1, further comprising:

and carrying out counterstudy training on the feature extraction model to reduce the discrimination accuracy of a discriminator of the counterstudy on the content domain to which the feature vector belongs as a training target.

3. The method according to claim 2, wherein in the antagonistic learning training, the following training processes are alternately performed:

fixing the model parameters of the discriminator, and training the feature extraction model to reduce the discrimination accuracy of the discriminator on the content domain to which the feature vector belongs;

and fixing the model parameters of the feature extraction model, and training the discriminator to improve the discrimination accuracy of the discriminator on the content domain to which the feature vector belongs.

4. The method of claim 1, wherein the raw data comprises data of a plurality of modalities, the performing feature extraction on the raw data of the samples in the cross-domain sample combination, and the generating the feature vectors corresponding to the samples comprises:

and respectively extracting the features of the original data of each mode of the samples in the cross-domain sample combination to generate a plurality of mode feature vectors corresponding to the samples.

5. The method of claim 1, wherein the raw data of the sample comprises image data and/or text data,

the performing feature extraction on the original data of the samples in the cross-domain sample combination, and generating a feature vector corresponding to the samples includes:

and respectively extracting the features of the image data and/or the text data of the samples in the cross-domain sample combination to generate image feature vectors and/or text feature vectors corresponding to the samples.

6. The method of claim 4, wherein the trained feature extraction model employs a pre-trained feature extraction model with cross-modal feature extraction capability.

7. A commodity mounting processing method comprises the following steps:

8. The method of claim 7, wherein obtaining a plurality of candidate good data comprises:

and determining the candidate commodity data in response to the user selecting an operation in the commodity database.

9. The method of claim 7, wherein the merchandise feature vectors and/or multimedia feature vectors comprise image feature vectors extracted based on image data and/or text feature vectors extracted based on text data.

10. A commodity mounting processing method comprises the following steps:

and mounting the commodity data on the multimedia data.

11. The method of claim 10, wherein the retrieving of the commodity data according to the video vector features, and the obtaining of the commodity data corresponding to the commodity feature vector having the correlation degree between the multimedia feature vectors greater than a preset threshold value comprises:

according to the multimedia feature vector, searching in a commodity feature vector database to obtain a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset threshold value;

and acquiring corresponding commodity data according to the commodity feature vector.

12. A recommendation processing method, comprising:

13. The method according to claim 12, wherein retrieving multimedia data and/or commodity data according to the historical multimedia feature vector and/or the historical commodity feature vector, and acquiring the multimedia data and/or the commodity data with the correlation degree between the multimedia feature vector and/or the commodity feature vector being greater than a preset threshold value, as recommendation data, comprises:

according to the historical multimedia feature vector and/or the historical commodity feature vector, searching in a database in which the multimedia feature vector and/or the commodity feature vector are stored, and acquiring the multimedia feature vector and/or the commodity feature vector of which the correlation degree with the historical multimedia feature vector and/or the historical commodity feature vector is greater than a preset threshold value;

and determining corresponding multimedia data and/or commodity data as the recommendation data according to the acquired multimedia feature vector and/or commodity feature vector.

14. The method of claim 12, wherein there are a plurality of the historical multimedia data and/or the historical merchandise data, and a plurality of the historical multimedia feature vectors and/or the historical merchandise feature vectors, the method further comprising: fusing the plurality of historical multimedia feature vectors and/or the historical commodity feature vectors to generate fused feature vectors,

the retrieving of multimedia data and/or commodity data according to the historical multimedia feature vector and/or the historical commodity feature vector, and the obtaining of the multimedia data and/or commodity data with the correlation degree between the multimedia feature vector and the commodity feature vector being larger than a preset threshold value comprises the following steps:

and retrieving multimedia data and/or commodity data according to the fusion feature vector, and acquiring the multimedia data and/or commodity data of which the correlation degree with the fusion feature vector is greater than a preset threshold value.

15. A search processing method, comprising:

generating a query vector according to retrieval information input by a user;

16. The method of claim 15, wherein in case of returning multimedia data to a user, further comprising:

acquiring commodity feature vectors corresponding to commodity data mounted in the multimedia data;

and acquiring a second correlation degree between the multimedia feature vector corresponding to the multimedia data and the commodity feature vector, if the second correlation degree is greater than or equal to a preset second threshold value, keeping the mounted commodity data in a retrieval result returned to the user, and otherwise, deleting the mounted commodity data.

17. The method of claim 15, wherein in case of returning multimedia data to a user, further comprising:

according to the multimedia feature vector corresponding to the multimedia data, inquiring in a commodity feature vector database to obtain a commodity feature vector of which the correlation degree with the multimedia feature vector is greater than a preset third threshold value;

and acquiring corresponding commodity data according to the commodity feature vector, and mounting the commodity data into the multimedia data as a retrieval result for providing.

18. A training apparatus for a feature extraction model, comprising:

19. The apparatus of claim 18, wherein the training module is further configured to: and carrying out counterstudy training on the feature extraction model to reduce the discrimination accuracy of a discriminator of the counterstudy on the content domain to which the feature vector belongs as a training target.

20. A commodity mounting processing device includes:

21. A processing device associated with commodity mounting, comprising:

22. A recommendation processing apparatus comprising:

23. A search processing apparatus comprising:

24. A commodity mounting processing method comprises the following steps:

acquiring multimedia data uploaded by a user;

25. The method of claim 24, further comprising:

obtaining comment data after the multimedia data are published;

and selecting matched commodities from the commodity database according to the features obtained by performing cross-domain feature extraction on the comment data, and recommending the matched commodities to the user for updating the commodities mounted in the multimedia data.

26. A commodity mounting processing method comprises the following steps:

27. The method of claim 26, wherein recommending, based on the relevance, product data from among a plurality of product data to be mounted comprises:

and ranking the relevance of the plurality of commodity data and the multimedia data, and recommending the commodity data based on the ranking.

28. The method of claim 26, further comprising:

obtaining comment data after the multimedia data are published;

and updating the commodity recommendation to the user according to the features obtained by performing cross-domain feature extraction on the comment data.

29. An electronic device, comprising:

a memory for storing a program;

a processor for executing the program stored in the memory to execute the training method of the feature extraction model according to any one of claims 1 to 6, and/or the processing method of the commodity mounting according to any one of claims 7 to 11 and 24 to 28, and/or the recommendation processing method according to any one of claims 12 to 14, and/or the retrieval processing method according to any one of claims 15 to 17.