CN117216374A

CN117216374A - Content recommendation method, content recommendation device, computer readable storage medium and computer equipment

Info

Publication number: CN117216374A
Application number: CN202310375092.0A
Authority: CN
Inventors: 陈禹昕; 祁仲昂; 张子琦; 骆颖民; 单瀛; 原春锋; 胡卫明
Original assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Tencent Technology Shenzhen Co Ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-12-12

Abstract

The embodiment of the application discloses a content recommendation method, a content recommendation device, a computer readable storage medium and computer equipment; extracting first visual features from the image sample through a preset content recommendation model, and extracting query text word features from the query text sample; calculating similarity of the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information based on the similarity; obtaining a defect text sample, extracting the word characteristics of the defect text from the defect text sample, and obtaining the second visual characteristics of the image sample; predicting a defect word according to the second visual feature and the defect text word feature, and predicting a correction word of the defect word; determining second loss information according to the defect words, the correction words and the query text samples of the defect text; performing convergence processing on a preset content recommendation model based on the first loss information and the second loss information; and carrying out content recommendation processing on the query text through the trained content recommendation model. Therefore, the content recommendation accuracy is improved.

Description

Content recommendation method, content recommendation device, computer readable storage medium and computer equipment

Technical Field

The present application relates to the field of internet technologies, and in particular, to a content recommendation method, a content recommendation device, a computer readable storage medium, and a computer device.

Background

With the rapid development of internet technology, a large amount of information, such as various videos, is generated on a network, however, the large amount of videos makes it difficult for a user to quickly obtain the truly required videos therefrom, and the pushing efficiency of the videos is reduced. In order to accurately push videos required by users to users, in most of the existing content recommendation methods, similarity between visual features of images and videos to be selected and text features of query texts is calculated by adopting a deep learning model, and content matched with the query texts is screened out from candidate contents according to the similarity and pushed to the users.

In the research and practice process of the prior art, the existing content recommendation method for selecting the content matched with the query text from the candidate content according to the similarity between the visual characteristics and the text characteristics of the candidate content is found, so that the fine-granularity semantic association between the image and the text cannot be accurately identified, and the accuracy of content recommendation is lower.

Disclosure of Invention

The embodiment of the application provides a content recommendation method, a content recommendation device, a computer readable storage medium and computer equipment, which can improve the accuracy of content recommendation.

The embodiment of the application provides a content recommendation method, which comprises the following steps:

extracting at least one first visual feature from an image sample through a preset content recommendation model, and extracting query text word features from a query text sample corresponding to the image sample;

calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information corresponding to the preset content recommendation model based on the similarity;

obtaining a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample, and obtaining at least one second visual characteristic corresponding to the image sample;

predicting a defect word which is not matched with the image sample in the defect text sample according to the second visual characteristics and the defect text word characteristics, and predicting a correction word corresponding to the defect word;

determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample;

Based on the first loss information and the second loss information, performing convergence processing on the preset content recommendation model to obtain a trained content recommendation model;

and carrying out content recommendation processing on the query text through the trained content recommendation model.

Correspondingly, an embodiment of the present application provides a content recommendation device, including:

the extraction unit is used for extracting at least one first visual feature from the image sample through a preset content recommendation model, and extracting query text word features from a query text sample corresponding to the image sample;

the computing unit is used for computing the similarity between the image sample and the query text sample based on the first visual characteristics and the query text word characteristics, and determining first loss information corresponding to the preset content recommendation model based on the similarity;

the acquisition unit is used for acquiring a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample and acquiring at least one second visual characteristic corresponding to the image sample;

the prediction unit is used for predicting the defect word which is not matched with the image sample in the defect text sample according to the second visual characteristic and the defect text word characteristic, and predicting the correction word corresponding to the defect word;

The determining unit is used for determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample;

the convergence unit is used for carrying out convergence processing on the preset content recommendation model based on the first loss information and the second loss information to obtain a trained content recommendation model;

and the recommending unit is used for carrying out content recommending processing on the query text through the trained content recommending model.

In an embodiment, the second visual feature includes a global visual feature corresponding to the image sample, and the prediction unit includes:

the visual feature fusion subunit is used for carrying out visual feature fusion processing on the global visual features and the defect text word features to obtain the defect text word features after visual fusion;

a first target defect text word feature extraction subunit, configured to extract, from the visually fused defect text word features, a first target defect text word feature that includes a semantic association relationship between the image sample and the defect text sample;

and the first word prediction subunit is used for predicting the defect word which is not matched with the image sample in the defect text sample and the correction word corresponding to the defect word based on the first target defect text word characteristic.

In an embodiment, the first target defect text word feature extraction subunit includes:

the first associated feature extraction module is used for extracting associated features of the visual fused defect text word features to obtain first associated features corresponding to each word feature in the visual fused defect text word features;

the first association weight determining module is used for determining a first association weight corresponding to each word feature in the visual fused defect text word features based on the first association features;

and the first fusion module is used for carrying out fusion processing on the first association features according to the first association weight to obtain first target defect text word features containing semantic association relations between the image samples and the defect text samples.

In an embodiment, the second visual feature includes a local visual feature corresponding to the image sample, and the prediction unit includes:

a target defect text word feature extraction subunit, configured to extract target defect text word features that include association relationships between each word in the query text sample from the defect text word features;

the feature fusion subunit is used for carrying out feature fusion processing on the local visual features and the target defect text word features to obtain second target defect text word features;

And the second word prediction subunit is used for predicting the defect word which is not matched with the image sample in the defect text sample and the correction word corresponding to the defect word according to the characteristics of the second target defect text word.

In an embodiment, the feature fusion subunit comprises:

the second associated feature extraction module is used for extracting the associated features of the local visual features and the target defect text word features to obtain visual associated features corresponding to the local visual features and text associated features corresponding to the target defect text word features;

a second association weight determining module, configured to determine a second association weight corresponding to the visual association feature based on the visual association feature and the text association feature;

and the second fusion module is used for carrying out fusion processing on the visual association characteristics based on the second association weight to obtain second target defect text word characteristics.

In an embodiment, the computing unit comprises:

a history feature obtaining subunit, configured to obtain a momentum visual feature corresponding to the first visual feature and a momentum text word feature corresponding to the query text word feature;

The sample pair construction subunit is used for constructing a negative sample feature pair and a positive sample feature pair corresponding to the preset content recommendation model based on the first visual feature, the query text word feature, the momentum visual feature and the momentum text word feature;

and the similarity calculating subunit is used for calculating a first similarity between the negative sample feature pairs and a second similarity between the positive sample feature pairs, and determining the first similarity and the second similarity as the similarity between the image sample and the query text sample.

In an embodiment, the determining unit includes:

a difference word recognition subunit, configured to recognize a difference word in the query text sample compared to the defect text sample based on the query text sample and the defect word;

the first loss calculation subunit is used for calculating the accuracy prediction loss information corresponding to the preset content recommendation model according to the accuracy prediction probability corresponding to the difference word and the defect word;

the second loss calculation subunit is used for determining correction prediction loss information corresponding to the preset content recommendation model based on the correction prediction probabilities corresponding to the difference words and the correction words;

And the loss determination subunit is used for determining second loss information corresponding to the preset content recommendation model according to the correctness prediction loss information and the correction prediction loss information.

In addition, the embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any content recommendation method provided by the embodiment of the application.

In addition, the embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein the memory stores an application program, and the processor is used for running the application program in the memory to realize the content recommendation method provided by the embodiment of the application.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps in the content recommendation method provided by the embodiment of the application.

According to the embodiment of the application, at least one first visual characteristic is extracted from an image sample through a preset content recommendation model, and query text word characteristics are extracted from a query text sample corresponding to the image sample; calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information corresponding to the preset content recommendation model based on the similarity; obtaining a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample, and obtaining at least one second visual characteristic corresponding to the image sample; predicting a defect word which is not matched with the image sample in the defect text sample according to the second visual characteristic and the defect text word characteristic, and predicting a correction word corresponding to the defect word; determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample; based on the first loss information and the second loss information, carrying out convergence processing on the preset content recommendation model to obtain a trained content recommendation model; and carrying out content recommendation processing on the query text through the trained content recommendation model. According to the method, the similarity between the image sample and the query text sample is calculated through the first visual features of the image sample and the query text word features of the query text sample, so that first loss information of global semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the similarity, then, defect words which are not matched with the image sample and correction words corresponding to the defect words in the defect text sample are predicted according to the second visual features of the image sample and the defect text word features of the defect text sample, so that second loss information of local semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the defect words, the correction words and the query text sample, the preset content recommendation model is converged based on the first loss information and the second loss information, the preset content recommendation model can learn the global semantic association and the local semantic association between the image sample and the query text, content recommendation processing is performed according to the trained content recommendation model, association between the local text of the query text and the multi-visual sense of the image can be identified, accordingly, the content recommendation degree is improved, and the accuracy of matching with the query content is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an implementation scenario of a content recommendation method according to an embodiment of the present application;

fig. 2 is a flow chart of a content recommendation method according to an embodiment of the present application;

fig. 3 is an overall flow diagram of a content recommendation method according to an embodiment of the present application;

FIG. 4 is another flow chart of a content recommendation method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a content recommendation device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The embodiment of the application provides a content recommendation method, a content recommendation device, a computer readable storage medium and computer equipment. The content recommendation device may be integrated in a computer device, which may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. Terminals may include, but are not limited to, cell phones, computers, intelligent voice interaction devices, intelligent appliances, vehicle terminals, aircraft, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

Referring to fig. 1, taking an example that a content recommendation device is integrated in a computer device, fig. 1 is a schematic diagram of an implementation scenario of a content recommendation method provided by an embodiment of the present application, where the computer device may be a server or a terminal, and may extract at least one first visual feature from an image sample through a preset content recommendation model, and extract a query text word feature from a query text sample corresponding to the image sample; calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information corresponding to the preset content recommendation model based on the similarity; obtaining a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample, and obtaining at least one second visual characteristic corresponding to the image sample; predicting a defect word which is not matched with the image sample in the defect text sample according to the second visual characteristic and the defect text word characteristic, and predicting a correction word corresponding to the defect word; determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample; based on the first loss information and the second loss information, carrying out convergence processing on the preset content recommendation model to obtain a trained content recommendation model; and carrying out content recommendation processing on the query text through the trained content recommendation model.

It should be noted that the embodiments of the present application may be applied to various scenarios, including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, driving assistance, and the like. The schematic view of the implementation environment of the content recommendation method shown in fig. 1 is only an example, and the implementation environment of the content recommendation method described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application. As can be known by those skilled in the art, with the evolution of content recommendation and the appearance of new service scenarios, the technical scheme provided by the application is also applicable to similar technical problems.

The scheme provided by the embodiment of the application relates to the technology of computer vision and the like of artificial intelligence, and is specifically described by the following embodiment. The following description of the embodiments is not intended to limit the preferred embodiments.

The present embodiment will be described from the point of view of a content recommendation apparatus, which may be integrated in a computer device, which may be a server, to which the present application is not limited.

Referring to fig. 2, fig. 2 is a flowchart illustrating a content recommendation method according to an embodiment of the application. The content recommendation method comprises the following steps:

In step 101, at least one first visual feature is extracted from an image sample by a preset content recommendation model, and a query text word feature is extracted from a query text sample corresponding to the image sample.

The preset content recommendation model can be an artificial intelligent model for content recommendation, can be a model which is required to be trained for preset model parameters, and can screen out the most matched content from the content to be recommended for recommendation according to an input text after the training of the preset content recommendation model is completed. The image sample may be a sample of an image modality used to train the preset content recommendation model, the first visual feature may be a visual feature of the image sample, the visual feature may be information characterizing the image sample, for example, the first visual feature may be a global visual feature of the image sample, the global visual feature may be global information characterizing the image sample in vision, may include feature information of an overall attribute of the image sample, for example, may include features such as a color feature, a texture feature, and a shape feature. The query text sample may be a sample of a text modality used to train a preset content recommendation model, the query text sample may be text information describing an image sample, and the query text word feature may be information characterizing the query text sample. The image sample and the query text sample may be a sample pair having a correspondence relationship, and the image sample may be queried by querying text information described in the text sample.

The method for extracting the query text word features from the query text samples corresponding to the image samples may be various, for example, a Visual Encoder (Visual Encoder) in the preset content recommendation model may be used to extract at least one first Visual feature from the image samples, and a text Encoder (text Encoder) in the preset content recommendation model may be used to extract the query text word features from the query text samples corresponding to the image samples.

For example, referring to fig. 3, fig. 3 is an overall flow chart of a content recommendation method provided by the embodiment of the present application, in an image text comparison learning module, a visual encoder in a preset content recommendation model may be used to extract at least one first visual feature from an image sample, and a text encoder in the preset content recommendation model may be used to extract a query text word feature from a query text sample corresponding to the image sample. In particular, the image sample v _i Dividing into a plurality of image blocks (Image patch Token) and introducing image classification features, which may be cls_token, or Class Token, CLS for short, corresponding to the image samples, which may be a sample-independent, learnable embedded vector, which may not be randomly generated based on the image content, so that a bias towards a particular image block in the image samples may be avoided, Therefore, information on the image blocks in the image sample can be better gathered, so that global features of the image sample are aggregated. Inputting the image blocks and the classification features of the image sample into a visual encoder, performing feature extraction on the image blocks and the image classification features of the image sample through N layers of visual feature extraction layers in the visual encoder, so that the image classification features output by the visual encoder can be used as global visual features, namely first visual features, of the image sample, wherein the visual feature extraction layers can comprise Self Attention network units (Self Attention) and multi-layer perceptron units (Multilayer Perceptron, MLP) which can identify the association relation between the image blocks of the image sample and perform feature extraction on the image blocks, performing classification processing on the extracted features through the multi-layer perceptron units to obtain features containing global information of the image sample, and performing feature extraction processing on the N layers of visual feature extraction layers to obtain first visual featuresFor query text sample t _i Can sample the query text t _i Word segmentation processing is carried out, word embedding processing is carried out on each word in the query text sample, a word embedding sequence of the query text sample is obtained, text classification features are introduced, the text classification features can be classification features corresponding to the query text sample, and similarly, the text classification features can be better fused with information of each word in the query text sample to carry out global convergence, so that the word embedding sequence and the text classification features corresponding to the text sample can be input into a text encoder, and M in the text encoder is used for processing the text classification features ₁ And M ₂ The text feature extraction layer of each level performs feature extraction on word embedding and classification features corresponding to the query text sample, so that the text classification features output by the text encoder can be used as query text word features of the query text sample, wherein M is ₁ The text feature extraction layer of each hierarchy can be a hierarchy containing Cross Attention network units (Cross Attention) in a text encoder, and the text contrast learning module of an imageThe cross-attention network unit may not be activated, i.e. the cross-attention network unit is not activated, the text feature extraction layer may include a self-attention network unit and a multi-layer perceptron unit, the self-attention network unit may identify the upper and lower Wen Yuyi associations between each word in the query text sample and perform feature extraction on the word embedding sequence, and classify the extracted features by the multi-layer perceptron unit to obtain features containing global information of the query text sample, so that the feature extraction process of the text feature extraction layers of multiple layers may obtain query text word features of the query text sample>

In step 102, a similarity between the image sample and the query text sample is calculated based on the first visual feature and the query text word feature, and first loss information corresponding to the preset content recommendation model is determined based on the similarity.

The similarity may be information representing a degree of similarity between the image sample and the query text sample, and the first loss information may be information representing a gap between a degree of similarity between the preset content recommendation model prediction image sample and the query text sample and an actual degree of similarity.

Wherein, based on the first visual feature and the query text word feature, there may be a plurality of ways of calculating the similarity between the image sample and the query text sample, for example, please continue to refer to fig. 3, the first visual feature and the query text word feature may be mapped into the same shared semantic space, so that the first visual feature in the semantic space may be calculatedAnd query text word feature->And obtaining the similarity between the image sample and the query text sample by the inner product of the image sample and the query text sample.

After calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, first penalty information corresponding to the preset content recommendation model may be determined based on the similarity. The method for determining the first loss information corresponding to the preset content recommendation model based on the similarity may be various, for example, a first negative sample pair similarity between the image sample and the neighbor text sample may be calculated based on the first visual feature and the neighbor text word feature of the neighbor text sample corresponding to the query text sample, a first contrast loss information corresponding to the preset content recommendation model may be calculated based on the first negative sample pair similarity and the similarity between the image sample and the query text sample, a second negative sample pair similarity between the query text sample and the neighbor image sample may be calculated based on the query text word feature and the neighbor visual feature of the neighbor image sample corresponding to the image sample, a second contrast loss information corresponding to the preset content recommendation model may be calculated based on the second negative sample pair similarity and the similarity between the image sample and the query text sample, and the first contrast loss information and the second contrast loss information may be subjected to fusion processing to obtain the first loss information corresponding to the preset content recommendation model.

The neighbor text samples may be other text samples in the same batch as the query text samples, the neighbor text word features may be query text word features corresponding to the neighbor text samples, the first negative sample pair similarity may be similarity between the image samples and the neighbor text samples, the first contrast loss information may be loss information of the text queried according to the image, the neighbor image samples may be other image samples in the same batch as the image samples, the neighbor visual features may be visual features corresponding to the neighbor image samples, the second negative sample pair similarity may be similarity between the query text samples and the neighbor image samples, and the second contrast loss information may be loss information of the image queried according to the text.

Alternatively, the calculation formula of the first contrast loss information may be expressed as

Wherein the L is ₁ May be expressed as first contrast loss information, B may be expressed as a batch size, Σ may be expressed as a summation sign, log may be expressed as a log function sign, τ may be expressed as a temperature overshoot,can represent the first visual feature of the mapping corresponding to the image sample in the shared semantic space,/for example >Query text word features mapped in shared semantic space, which may be represented as corresponding to a sample of query text,/-, are>May be represented as query text word features mapped in shared semantic space corresponding to neighbor text samples. /> T represents a transposed symbol, x ^T y may represent the similarity between x and y, i.eMay represent information including similarity between the image sample and the query text sample,may represent information including negative sample pair similarity between the image sample and the neighbor text sample.

Accordingly, the calculation formula of the second contrast loss information can be expressed as

Wherein L is ₂ May be represented as a second contrast loss information,may be represented as visual features mapped in the shared semantic space corresponding to the neighbor image samples.

The method for fusing the first contrast loss information and the second contrast loss information may be multiple, for example, the first contrast loss information and the second contrast loss information may be weighted and averaged to obtain first loss information corresponding to the preset content recommendation model, and the first loss information corresponding to the preset content recommendation model may be represented as L _align ＝(L ₁ +L ₂ )/2。

In an embodiment, a momentum contrast learning method may be used as a training target of the image text contrast learning module, so as to increase the number of negative samples in the model training process, thereby improving the training effect of contrast learning. Accordingly, for the step of calculating a similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, the step of calculating a similarity between the image sample and the query text word feature may comprise: and acquiring momentum visual features corresponding to the first visual features and momentum text word features corresponding to the query text word features, constructing a negative sample feature pair and a positive sample feature pair corresponding to the preset content recommendation model based on the first visual features, the query text word features, the momentum visual features and the momentum text word features, calculating first similarity between the negative sample feature pair and second similarity between the positive sample feature pair, and determining the first similarity and the second similarity as the similarity between the image sample and the query text sample.

The Momentum visual feature may be a visual feature generated by a Momentum visual Encoder, the Momentum visual Encoder may be a Momentum Encoder (Momentum Encoder) for feature encoding an image, the parameters of the Momentum visual Encoder may be obtained by performing Momentum update on parameters of a video Encoder, the Momentum text word feature may be a text word feature generated by a Momentum text Encoder, the Momentum text Encoder may be a Momentum Encoder for feature encoding a text, and the parameters of the Momentum text Encoder may be obtained by performing Momentum update on parameters of the text Encoder. Alternatively, the momentum visual characteristics and momentum text word characteristics obtained by historical iteration of the momentum visual encoder and the momentum text encoder may be respectively stored in two queues, the momentum visual characteristics and the momentum text word characteristics may be negative samples for comparison learning, the negative sample characteristic pair may be a characteristic pair which is a negative sample, the first visual characteristics and the batch and the momentum text word characteristics in the batch and the queue may be included, the query text word characteristics and the momentum visual characteristics in the batch and the queue may be included, the positive sample characteristic pair may be a characteristic pair which is a positive sample, the first visual characteristics and the momentum text word characteristics corresponding to the query text sample may be included, the query text word characteristics corresponding to the query text sample and the momentum visual characteristics corresponding to the image sample may be included, the first similarity may be a similarity between the negative sample characteristic pair, and the second similarity may be a similarity between the positive sample characteristic pair.

Accordingly, in the momentum contrast learning method, the first loss information may include first momentum contrast loss information and second momentum contrast loss information, the first momentum contrast loss information may represent loss information of text retrieved from an image, and a calculation formula of the first momentum contrast loss may be expressed as:

wherein L is _I2T May be expressed as first momentum contrast loss information, B may be expressed as batch size, N _q May be expressed as the size of the queue to which the momentum encoder corresponds, Σ may be expressed as a summation sign, log may be expressed as a logarithmic function sign, τ may be expressed as a temperature super-parameter,can represent the first visual feature of the mapping corresponding to the image sample in the shared semantic space,/for example>Query text word features mapped in shared semantic space that may be represented as query text sample correspondences, i may be represented as the ith in the lot, j may be represented as the jth in the queue and lot,/in the queue and lot>Momentum text word features mapped in the shared semantic space, which may be represented as momentum text word features in a batch,/->May be represented as momentum text word features in queues and batches that map in a shared semantic space. / >T represents a transposed symbol, x ^T y may represent the similarity between x and y, i.e. +.>Information comprising the similarity between the image sample and the query text sample, i.e. information comprising a second similarity, may be represented +.>Information including the similarity between text samples and image samples in the batch and queue, i.e. information containing a first similarity, may be represented.

Accordingly, the second momentum contrast loss information may characterize loss information of an image retrieved from text, and the calculation formula of the second momentum contrast loss may be expressed as:

wherein L is _T2I May be represented as second momentum contrast loss information,visual features mapped in the shared semantic space corresponding to the momentum visual features output by the momentum visual encoder in the batch corresponding to the preset content recommendation model can be represented by +.>May be represented as visual features mapped in the shared semantic space corresponding to momentum visual features in the batch and in the queue corresponding to momentum visual encoders.

In this way, the first momentum contrast loss information and the second momentum contrast loss information may be fused to obtain first loss information, for example, the first loss information may be L _align ＝(L _I2T +L _T2I )/2。

In step 103, a defective text sample corresponding to the query text sample is obtained, defective text word features are extracted from the defective text sample, and at least one second visual feature corresponding to the image sample is obtained.

The defective text sample may be a text sample corresponding to the query text sample and having a defect such as a local error, for example, a text sample obtained by performing word transformation processing on the query text sample, the defective text word feature may be information representing the defective text sample, the second visual feature may be information representing the image sample, for example, a global visual feature of the image sample may be a local visual feature of the image sample, and may further include a global visual feature and a local visual feature of the image sample, where the global visual feature may be information representing a global attribute of the image sample, and the local visual feature may be information representing a local attribute of the image sample.

The method includes performing word transformation processing on a query text sample to obtain a defect text sample corresponding to the query text sample, for example, performing random masking processing on words in the query text sample by using a pre-training language model to obtain a masked text sample, performing feature extraction on the masked text sample to obtain masked text features, performing word prediction processing on mask positions in the masked text sample according to the masked text features, and generating a plurality of defect text samples corresponding to the query text sample based on the predicted words.

The pre-trained language model may be a bi-directional encoder token (Bidirectional Encoder Representations from Transformers, abbreviated BERT) from the transformer, the masked text may be a text sample obtained by masking a portion of words in the query text sample, and the masked text feature may be information that characterizes the masked text.

In one embodiment, please continue with fig. 3, a knowledge-based text editing module (knowledges-based text edition) may be used to perform word transformation processing on the query text sample to obtain a defective text sample corresponding to the query text sample, e.g., for the query text sample t _i "a woman plays with a dog" can be applied to the query text sample t _i The word "woman" in the Chinese word is subjected to masking processing to obtain a masked text t _i ^mask “a[MASK]playwith a dog "so that the pre-trained language model can be employed to mask the text t _i ^mask And extracting features to obtain masked text features, then carrying out mask recovery processing on the masked text features to generate reasonable predicted words which are not matched with the original image, for example, for word "woman", the words such as "boy", "man" or "cat" can be replaced according to semantic scenes, the replaced words are incorrect words (Incorrect word token), and other words except "woman" in the defect text sample are correct words (Correct word token), so that the defect text sample corresponding to the query text sample can be obtained according to the predicted words. Therefore, the query text sample with local semantic errors can be obtained based on the defect text sample, so that the preset content recommendation model can be trained according to the defect text sample The preset content recommendation model learns the local semantic information of the query text sample, so that the semantic information of the text can be identified in a fine granularity, further, the content matched with the semantic of the query text can be more accurately retrieved, and the content recommendation accuracy is improved.

In step 104, according to the second visual feature and the defect text word feature, predicting the defect word in the defect text sample, which does not match with the image sample, and predicting the correction word corresponding to the defect word.

The defect word may be a word that is not matched with the image sample in the predicted defect text sample, and the correction word may be a word that is predicted by correcting the defect word with a preset content recommendation model, that is, a correct word that is predicted to be matched with the image sample according to the image information in the image sample.

The method for predicting the defect word in the defect text sample, which is not matched with the image sample, and predicting the correction word corresponding to the defect word according to the second visual feature and the defect text word feature may be multiple, for example, when the second visual feature includes the global visual feature corresponding to the image sample, the step of predicting the defect word in the defect text sample, which is not matched with the image sample, and predicting the correction word corresponding to the defect word according to the second visual feature and the defect text word feature may include: and carrying out visual feature fusion processing on the global visual features and the defect text word features to obtain defect text word features after visual fusion, extracting first target defect text word features containing semantic association relations between the image samples and the defect text samples from the defect text word features after visual fusion, and predicting defect words which are not matched with the image samples and correction words corresponding to the defect words in the defect text samples based on the first target defect text word features.

The visual fused defective text word feature may be a defective text word feature fused with a global visual feature, and the first target defective text word feature may be a text word feature that extracts a semantic association relationship between an image sample and a defective text sample from the visual fused defective text word feature.

The method for performing the visual feature fusion processing on the global visual feature and the defect text word feature to obtain the defect text word feature after the visual fusion may have various manners, for example, please continue to refer to fig. 3, in a visual-language error modeling module based on the global visual feature, the global visual feature and each word feature in the defect text word feature may be subjected to feature addition to obtain the defect text word feature after the visual fusion. For example, one can assume that the global visual features areThe defect text word features are (x 1, x2, x 3), wherein x1, x2 and x3 can be word features of each word in the defect text word features, so that the global visual features and each word feature in the defect text word features are added to obtain the defect text word features after visual fusion, wherein the defect text word features are->

After the global visual feature and the defect text word feature are subjected to visual feature fusion processing, a first target defect text word feature containing a semantic association relationship between an image sample and a defect text sample can be extracted from the defect text word features after visual fusion. The method for extracting the first target defect text word feature containing the semantic association relationship between the image sample and the defect text sample from the visual fused defect text word features can be multiple, for example, the method can extract the association feature of the visual fused defect text word features to obtain first association features corresponding to each word feature in the visual fused defect text word features, determine first association weights corresponding to each word feature in the visual fused defect text word features based on the first association features, and fuse the first association features according to the first association weights to obtain the first target defect text word features containing the semantic association relationship between the image sample and the defect text sample.

The first association feature may be information representing an association relationship between each word feature and other word features in the visually fused defective text word features, and the first association weight may be information representing an importance degree of each word feature in the visually fused defective text word features, that is, information representing an association degree between each word feature and other word features in the visually fused defective text word features.

The method for extracting the associated feature of the visual fused defect text word feature to obtain the first associated feature corresponding to each word feature in the visual fused defect text word feature may be various, for example, please refer to fig. 3, in which a Self-Attention network unit (Self Attention) may be used to extract the feature of each word in the visual fused defect text word feature to obtain the first associated feature corresponding to each word feature, for example, each word feature may be converted into a three-dimensional space vector including a query vector (query, q), a key vector (key, k) and a value vector (value, v), and a specific conversion manner may be understood as a method for fusing each word feature in the visual fused defect text word feature with a conversion parameter in three dimensions, and the query vector, the key vector and the value vector are used as the first associated feature corresponding to each word feature in the visual fused defect text word feature.

After the associated feature extraction is performed on the visual fused defect text word features, a first associated weight corresponding to each word feature in the visual fused defect text word features can be determined based on the first associated features. The method for determining the first association weight corresponding to each word feature in the visually fused defective text word features based on the first association features may have multiple manners, for example, a self-attention network unit may be used to dot product a query vector corresponding to each word feature in the visually fused defective text word features with key vectors of other word features, an attention score (score) corresponding to each word feature may be obtained, and then the first association weight corresponding to each word feature in the visually fused defective text word features may be calculated based on the attention score corresponding to each word feature.

Besides the feature extraction of the visual fused defect text word features by adopting a self-attention network, other networks capable of capturing the association relation between each word feature and other word features in the visual fused defect text word features can be adopted, so that the weight of each word feature in all word features is determined.

After determining the first association weight corresponding to each word feature in the visual fused defect text word features based on the first association features, fusion processing can be carried out on the first association features according to the first association weights, so that the first target defect text word features containing the semantic association relation of the image sample and the defect text sample are obtained. The method for performing fusion processing on the first association features according to the first association weights may be various, for example, the first association features may be weighted according to the first association weights, and the weighted processing structures may be summed, so that the first target defect text word feature including the semantic association relationship between the image sample and the defect text sample may be obtained.

After the first target defect text word characteristics containing the semantic association relationship between the image sample and the defect text sample are extracted from the defect text word characteristics after visual fusion, the defect words which are not matched with the image sample and the correction words corresponding to the defect words in the defect text sample can be predicted based on the first target defect text word characteristics. In the visual-language error modeling module based on global visual characteristics, for example, please continue to refer to fig. 3, two classification layers in the preset content recommendation model may be adopted, a defect word that is not matched with the image sample in the defect text sample and a probability that the defect word is an incorrect word in the defect text sample are predicted based on the first target defect text word characteristics output by the text encoder, and then multiple classification layers in the preset content recommendation model may be adopted to predict multiple correction words corresponding to the defect word and a probability that the correction word is predicted, that is, the correct word corresponding to the defect word in the query text sample.

When the second visual feature includes a local visual feature corresponding to the image sample, the step of predicting a defective word in the defective text sample that does not match the image sample and predicting a correction word corresponding to the defective word according to the second visual feature and the defective text word feature may include: extracting target defect text word characteristics containing the association relation between each word in the query text sample from the defect text word characteristics, carrying out characteristic fusion processing on the local visual characteristics and the target defect text word characteristics to obtain second target defect text word characteristics, and predicting defect words which are not matched with the image sample and correction words corresponding to the defect words in the defect text sample according to the second target defect text word characteristics.

The target defect text word feature may be a text word feature including an association relationship between each word in the query text sample, and the second target defect text word feature may be a text word feature fused with the local visual semantics in the image sample and the association information between the local text semantics in the query text sample. Alternatively, with continued reference to fig. 3, the local visual features may be the visual features output by N visual feature extraction layers in the visual encoder.

The method for extracting the target defect text word feature including the association relationship between each word in the query text sample from the defect text word features may be various, for example, please refer to fig. 3, in which a Self-Attention network unit (Self attribute) may be used to extract the target defect text word feature including the association relationship between each word in the query text sample from the defect text word feature.

After the target defect text word characteristics including the association relation between each word in the query text sample are extracted from the defect text word characteristics, the local visual characteristics and the target defect text word characteristics can be subjected to characteristic fusion processing. The method includes performing feature fusion processing on the local visual feature and the target defect text word feature to obtain a second target defect text word feature, for example, performing associated feature extraction on the local visual feature and the target defect text word feature to obtain a visual associated feature corresponding to the local visual feature and a text associated feature corresponding to the target defect text word feature, determining a second associated weight corresponding to the visual associated feature based on the visual associated feature and the text associated feature, and performing fusion processing on the visual associated feature based on the second associated weight to obtain the second target defect text word feature.

The method for extracting the relevant features of the local visual features and the target defect text word features to obtain the visual relevant features corresponding to the local visual features and the text relevant features corresponding to the target defect text word features may be various, for example, please refer to fig. 3 continuously, for the visual-semantic error modeling module based on the local visual features, a cross attention network unit may be used to extract the relevant features of the local visual features and the target defect text word features, for example, the word features in the target defect text word features may be converted into query vectors, and the local visual features of the image samples may be converted into key vectors and value vectors, and a specific conversion mode may be understood as a method for fusing the target defect text word features and the local visual features with the conversion parameters of corresponding dimensions, and using the corresponding query vectors as the text relevant features corresponding to each word feature in the target defect text word features and using the key vectors and the value vectors as the visual relevant features corresponding to each visual feature in the local visual features.

After the local visual feature and the target defect text word feature are subjected to the associated feature extraction, a second associated weight corresponding to the visual associated feature can be determined based on the visual associated feature and the text associated feature. The method for determining the second association weight corresponding to the visual association feature may be various based on the visual association feature and the text association feature, for example, a cross attention network may be used to dot product a query vector corresponding to the target defect text word feature with a key vector of the local visual feature, and an attention score of each local visual feature and the corresponding target defect text word feature may be obtained respectively, and then the second association weight of the visual association feature may be calculated based on the attention score.

After the second association weight corresponding to the visual association feature is determined based on the visual association feature and the text association feature, fusion processing can be performed on the visual association feature based on the second association weight, and the second target defect text word feature is obtained. The method for performing fusion processing on the visual association features based on the second association weights can be various, for example, the visual association features can be weighted according to the second association weights, and the weighted results are summed to obtain the second target defect text word features.

Various manners of predicting the defective word and the corrected word corresponding to the defective word in the defective text sample, which are not matched with the image sample, according to the characteristics of the second target defective text word may be provided, for example, please continue to refer to fig. 3, in the vision-language error modeling module based on the local visual characteristics, two classification layers in the preset content recommendation model may be adopted, the defective word which is not matched with the image sample in the defective text sample and the probability that the defective word is the incorrect word in the defective text sample are predicted based on the characteristics of the second target defective text word output by the text encoder, and then, multiple classification layers in the preset content recommendation model may be adopted to predict the corrected word corresponding to the defective word and the probability that the corrected word is predicted, that is, the correct word corresponding to the defective word in the query text sample.

In step 105, second loss information corresponding to the preset content recommendation model is determined according to the defect word, the correction word and the query text sample corresponding to the defect text sample.

The second loss information may be information representing a difference between defect words in the defect text sample predicted by the preset content recommendation model and a difference between correction words corresponding to the defect words in the defect text sample predicted.

The method includes determining second loss information corresponding to the preset content recommendation model according to a defect word, a correction word and a query text sample corresponding to the defect text sample, for example, identifying a difference word in the query text sample compared with the defect text sample based on the query text sample and the defect word, calculating accuracy prediction loss information corresponding to the preset content recommendation model according to accuracy prediction probabilities corresponding to the difference word and the defect word, determining correction prediction loss information corresponding to the preset content recommendation model based on the correction prediction probabilities corresponding to the difference word and the correction word, and determining second loss information corresponding to the preset content recommendation model according to the accuracy prediction loss information and the correction prediction loss information.

The difference word may be a word in the query text sample, which is different from the defect text sample, that is, a correct word (Correction) corresponding to the defect word in the defect text sample, the accuracy prediction probability may be a probability that the preset content recommendation model predicts that the defect word is a word in the defect text sample with errors, the accuracy prediction loss information may be information representing a difference between the defect word in the preset content recommendation model predicted to be in error and a real situation, and the Correction prediction probability may be a probability that the preset content recommendation model predicts that a Correction word corresponding to the defect word is a word for correcting the defect word. The correction prediction loss information can be information for representing a gap between correction words corresponding to the defect words in the preset content recommendation model and real conditions.

The method for identifying the difference word in the query text sample compared with the defect text sample based on the query text sample and the defect word may have various manners, for example, assuming that the query text sample is "a woman plays with a dog", the defect text sample may be "a boy plays with a dog", the difference word in the query text sample compared with the defect text sample may be "woman", the defect word is "boy" according to the query text sample and the defect word, and accordingly, the preset content recommendation model predicts the correction word corresponding to the defect word, for example, may include correction words such as "woman", "girl", "man", and the like, and outputs correction prediction probability corresponding to each correction word, for example, the correction prediction probability of "woman" may be 0.7, the correction prediction probability of "girl" may be 0.25, "man", and the like, and the larger the correction prediction probability corresponding to the correction word "woman" may indicate that the correction prediction accuracy of the preset content recommendation model is higher, and the better the model accuracy is.

The method for calculating the accuracy prediction loss information corresponding to the preset content recommendation model according to the accuracy prediction probabilities corresponding to the difference word and the defect word may be various, for example, when the second visual feature is a global visual feature, a calculation formula of the accuracy prediction loss information corresponding to the preset content recommendation model may be expressed as:

wherein L is _det (h ^v ) Indicating correctness prediction loss information when the second visual feature is a global visual feature, IE () may be indicated as a desired loss symbol,can be expressed as a probability distribution of detected defect words at the j-th position in the defective text sample,/or->Can be represented as a defective text sample, +.>The probability of detecting the defective word at the j-th position in the defective text sample, namely, the correctness prediction probability, can be expressed.

Accordingly, when the second visual feature is a global visual feature, a calculation formula of the corrected prediction loss information corresponding to the preset content recommendation model may be expressed as:

wherein L is _cor (h ^v ) May be represented as modified predicted loss information when the second visual feature is a global visual feature,can be expressed as a probability distribution of predicted modifier words at the j-th position in the defective text sample,/or- >The probability of the predicted correction word at the j-th position in the defective text sample, i.e., the correction prediction probability, can be expressed.

The method for determining the second loss information corresponding to the preset content recommendation model according to the correctness prediction loss information and the correction prediction loss information may be various, for example, the correctness prediction loss information and the correction prediction loss information may be accumulated to obtain the second loss information corresponding to the preset content recommendation model.

In an embodiment, the second visual feature may include a global visual feature and a local visual feature of the image sample, so that the step of predicting a defect word in the defect text sample, which is not matched with the image sample, and predicting a correction word corresponding to the defect word according to the second visual feature and the defect text word feature, and determining the second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample may include: predicting a first defect word which is not matched with the image sample in the defect text sample according to the global visual characteristics and the defect text word characteristics, predicting a first correction word corresponding to the first defect word, determining first sub-loss information corresponding to a preset content recommendation model according to the first defect word, the first correction word and the query text sample which are corresponding to the defect text sample, predicting a second defect word which is not matched with the image sample in the defect text sample according to the local visual characteristics and the defect text word characteristics, predicting a second correction word corresponding to the second defect word, determining second sub-loss information corresponding to the preset content recommendation model according to the second defect word, the second correction word and the query text sample which are corresponding to the defect text sample, and fusing the first sub-loss information and the second sub-loss information to obtain second loss information corresponding to the preset content recommendation model.

With continued reference to fig. 3, the first defect word may be a defect word predicted in the vision-language error modeling module based on the global visual feature, the first correction word may be a correction word corresponding to the first defect word, the first sub-loss information may be loss information in the vision-language error modeling module based on the global visual feature, the second defect word may be a defect word predicted in the vision-language error modeling module based on the local visual feature, the second correction word may be a correction word corresponding to the second defect word, and the second sub-loss information may be loss information in the vision-language error modeling module based on the local visual feature.

Specifically, please continue to refer to fig. 3, the accuracy prediction loss information and the correction prediction loss information corresponding to the preset content recommendation model may be calculated based on the accuracy prediction probability of the first defect word and the corresponding output predicted by the vision-language error modeling module of the global vision feature, and the accuracy prediction loss information and the correction prediction loss information corresponding to the preset content recommendation model, and the accuracy prediction loss information and the correction prediction loss information of the part may be fused to obtain the first sub-loss information corresponding to the global vision feature, and meanwhile, the accuracy prediction loss information and the correction prediction loss information corresponding to the preset content recommendation model may be calculated based on the accuracy prediction probability of the second defect word and the corresponding correction prediction loss information predicted by the vision-language error modeling module of the local vision feature, and the accuracy prediction loss information and the correction prediction loss information corresponding to the part may be fused to obtain the second sub-loss information corresponding to the local vision feature, for example, and the second sub-loss information may be obtained by adding.

Therefore, the knowledge-based text editing module provided by the embodiment of the application can generate the defect text sample with the local text semantic error, so that the global visual characteristic and the local visual semantic characteristic of the image sample interact with the local semantic error of the defect text sample, the correlation between the global visual and text semantics and the local visual and text semantics can be learned by the preset content recommendation model, and further the cross-modal fine-granularity and multi-granularity semantic correlation between the image sample and the query text sample can be fully promoted by detecting and correcting the defect text sample, so that the content matched with the query text can be screened out more accurately in a cross-modal manner, and the content recommendation accuracy is improved.

In step 106, based on the first loss information and the second loss information, convergence processing is performed on the preset content recommendation model, so as to obtain a trained content recommendation model.

The trained content recommendation model can be a trained content recommendation model and can be used for content recommendation processing of query texts. The trained content recommendation model is a model obtained by training based on the first loss information and the second loss information, so that the trained content recommendation model learns global semantic association and local semantic association between the image and the query text, content recommendation processing is carried out according to the trained content recommendation model, association between local text semantics of the query text and multi-granularity visual semantics of the image can be identified, content with high matching degree with the query text can be recommended, and accuracy of content recommendation is improved.

The method for performing convergence processing on the preset content recommendation model based on the first loss information and the second loss information may have multiple manners, for example, the first loss information and the second loss information may be accumulated, and then the accumulated result is averaged to obtain target loss information corresponding to the preset content recommendation model, so that the preset content recommendation model may be subjected to convergence processing according to the target loss information to obtain the trained content recommendation model.

In step 107, content recommendation processing is performed on the query text through the trained content recommendation model.

The query text can be a text used for querying contents, information described in the query text can be identified through a trained content recommendation model, and target contents are screened out from the contents to be recommended according to the identified information for recommendation. The target content may be content matching the query text, and the content may be image, video, or the like. Therefore, the content recommendation processing is carried out on the query text through the trained content recommendation model, so that multi-granularity and fine-granularity semantic association information between the vision and the text modes can be extracted from the query text and the content to be recommended more accurately, the target content matched with the query text can be identified from the content to be recommended more accurately, and the content recommendation accuracy is further improved.

The method of performing the content recommendation processing on the query text through the trained content recommendation model may be various, for example, please refer to fig. 3, and the trained content recommendation model may be a model structure only including the image contrast learning module. The method comprises the steps of carrying out feature extraction on the content to be recommended through the trained content recommendation model to obtain visual features of the content to be recommended, inputting query text into the content recommendation model, carrying out feature extraction on the query text according to the content recommendation model to obtain query text features, mapping the visual features of the content to be recommended and the query text features into a shared semantic space, calculating similarity between the visual features and the query text features in the shared semantic space, screening at least one target content matched with the query text in the content to be recommended according to the similarity between the visual features and the query text features, and carrying out content recommendation processing on the target content.

The method for screening at least one target content matched with the query text from the content to be recommended according to the similarity between the visual feature and the query text feature may be various, for example, the content to be recommended with highest similarity may be determined as the target content, the content to be recommended may be ranked according to the similarity, and the ranked content to be recommended may be determined as the target content.

As can be seen from the above, in the embodiment of the present application, at least one first visual feature is extracted from the image sample by the preset content recommendation model, and the query text word feature is extracted from the query text sample corresponding to the image sample; calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information corresponding to the preset content recommendation model based on the similarity; obtaining a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample, and obtaining at least one second visual characteristic corresponding to the image sample; predicting a defect word which is not matched with the image sample in the defect text sample according to the second visual characteristic and the defect text word characteristic, and predicting a correction word corresponding to the defect word; determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample; based on the first loss information and the second loss information, carrying out convergence processing on the preset content recommendation model to obtain a trained content recommendation model; and carrying out content recommendation processing on the query text through the trained content recommendation model. According to the method, the similarity between the image sample and the query text sample is calculated through the first visual features of the image sample and the query text word features of the query text sample, so that first loss information of global semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the similarity, then, defect words which are not matched with the image sample and correction words corresponding to the defect words in the defect text sample are predicted according to the second visual features of the image sample and the defect text word features of the defect text sample, so that second loss information of local semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the defect words, the correction words and the query text sample, the preset content recommendation model is converged based on the first loss information and the second loss information, the preset content recommendation model can learn the global semantic association and the local semantic association between the image sample and the query text, content recommendation processing is performed according to the trained content recommendation model, association between the local text of the query text and the multi-visual sense of the image can be identified, accordingly, the content recommendation degree is improved, and the accuracy of matching with the query content is improved.

According to the method described in the above embodiments, examples are described in further detail below.

In this embodiment, a description will be given of an example in which the content recommendation apparatus is specifically integrated in a computer device. The content recommendation method uses a server as an execution subject, and uses the second visual feature including global visual features and local visual features as an example to specifically describe the content recommendation method.

For better describing the embodiment of the present application, please refer to fig. 4, fig. 4 is another flow chart of the content recommendation method provided in the embodiment of the present application. The specific flow is as follows:

in step 201, the server extracts at least one first visual feature from the image sample through the preset content recommendation model, extracts a query text word feature from the query text sample corresponding to the image sample, calculates a similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determines first loss information corresponding to the preset content recommendation model based on the similarity.

The server may extract at least one first visual feature from the image sample through the preset content recommendation model, and may extract the query text word feature from the query text sample corresponding to the image sample in various manners, for example, the server may extract at least one first visual feature from the image sample by using a visual encoder in the preset content recommendation model, and extract the query text word feature from the query text sample corresponding to the image sample by using a text encoder in the preset content recommendation model.

For example, referring to fig. 3, in the image text comparison learning module, the server may extract at least one first visual feature from the image sample by using a visual encoder in the preset content recommendation model, and extract a query text word feature from a query text sample corresponding to the image sample by using a text encoder in the preset content recommendation model. In particular, the image sample v _i Dividing into a plurality of image blocks, and introducing image classification features, wherein the image classification features can be classification features corresponding to image samples, and the classification features can be a classification feature corresponding to the samplesThe independent and learnable embedded vectors can be randomly generated without being based on image content, so that the bias of a specific image block in the image sample can be avoided, and information on the image block in the image sample can be better gathered to aggregate the global features of the image sample. Inputting the image blocks and the classification features of the image samples into a visual encoder, performing feature extraction on the image blocks and the image classification features of the image samples through N layers of visual feature extraction layers in the visual encoder, so that the image classification features output by the visual encoder can be used as global visual features, namely first visual features, of the image samples, wherein the visual feature extraction layers can comprise Self Attention network units (Self Attention) and multi-layer perceptron units (MLP), the Self Attention network units can identify the association relationship between the image blocks of the image samples and perform feature extraction on the image blocks, and the extracted features are subjected to classification processing through the multi-layer perceptron units to obtain features containing global information of the image samples, so that the features of the first visual features can be obtained through the feature extraction processing of the N layers of visual feature extraction layers For query text sample t _i Can sample the query text t _i Word segmentation processing is carried out, word embedding processing is carried out on each word in the query text sample, a word embedding sequence of the query text sample is obtained, text classification features are introduced, the text classification features can be classification features corresponding to the query text sample, and similarly, the text classification features can be better fused with information of each word in the query text sample to carry out global convergence, so that the word embedding sequence and the text classification features corresponding to the text sample can be input into a text encoder, and M in the text encoder is used for processing the text classification features ₁ And M ₂ The text feature extraction layer of each level performs feature extraction on word embedding and classification features corresponding to the query text sample, so that the text classification features output by the text encoder can be used as query text word features of the query text sample, wherein M is ₁ Each level ofThe text feature extraction layer of (a) may be a hierarchy in a text encoder, which includes a Cross Attention network unit (Cross Attention), and when feature extraction is performed on a query text sample in an image text comparison learning module, the Cross Attention network unit may not be activated, i.e., the Cross Attention network unit is not activated, and the text feature extraction layer may include a self Attention network unit and a multi-layer perceptron unit, the self Attention network unit may identify upper and lower Wen Yuyi associations between each word in the query text sample and perform feature extraction on a word embedding sequence, and classify the extracted features through the multi-layer perceptron unit to obtain features including global information of the query text sample, so that feature extraction processing of the text feature extraction layer of multiple hierarchies may obtain feature of the query text word of the query text sample >

The server may calculate the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, for example, the server may use a momentum contrast learning method as a training target of the image text contrast learning module, so that the server may obtain the momentum visual feature corresponding to the first visual feature and the momentum text word feature corresponding to the query text word feature, construct a negative sample feature pair and a positive sample feature pair corresponding to the preset content recommendation model based on the first visual feature, the query text word feature, the momentum visual feature and the momentum text word feature, calculate the first similarity between the negative sample feature pair and the second similarity between the positive sample feature pair, and determine the first similarity and the second similarity as the similarity between the image sample and the query text sample.

Wherein L is _I2T May be expressed as first momentum contrast loss information, B may be expressed as batch size, N _q May be expressed as the size of the queue to which the momentum encoder corresponds, Σ may be expressed as a summation sign, log may be expressed as a logarithmic function sign, τ may be expressed as a temperature super-parameter,can represent the first visual feature of the mapping corresponding to the image sample in the shared semantic space,/for example>Query text word features mapped in shared semantic space that may be represented as query text sample correspondences, i may be represented as the ith in the lot, j may be represented as the jth in the queue and lot,/in the queue and lot>Momentum text word features mapped in the shared semantic space, which may be represented as momentum text word features in a batch,/->May be represented as momentum text word features in queues and batches that map in a shared semantic space. />T represents a transposed symbol, x ^T y may represent the similarity between x and y, i.e. +.>Information comprising the similarity between the image sample and the query text sample, i.e. information comprising a second similarity, may be represented +.>Information including the similarity between text samples and image samples in the batch and queue, i.e. information containing a first similarity, may be represented.

In this way, the server may perform fusion processing on the first momentum contrast loss information and the second momentum contrast loss information to obtain first loss information, for example, the first loss information may be L _align ＝(L _I2T +L _T2I )/2。

In step 202, the server obtains a defect text sample corresponding to the query text sample, extracts a defect text word feature from the defect text sample, obtains a global visual feature and a local visual feature corresponding to the image sample, and performs visual feature fusion processing on the global visual feature and the defect text word feature to obtain a defect text word feature after visual fusion.

Optionally, the server may perform random masking processing on the words in the query text sample by using a pre-training language model to obtain a masked text sample, perform feature extraction on the masked text sample to obtain masked text features, perform word prediction processing on mask positions in the masked text sample according to the masked text features, and generate a plurality of defect text samples corresponding to the query text sample based on the predicted words. The pre-trained language model may be BERT.

In one embodiment, please continue with fig. 3, a knowledge-based text editing module (knowledges-based text edition) may be used to perform word transformation processing on the query text sample to obtain a defective text sample corresponding to the query text sample, e.g., for the query text sample t _i "a woman plays with a dog" can be applied to the query text sample t _i The word "woman" in the Chinese word is subjected to masking processing to obtain a masked text t _i ^mask “a[MASK]playwith a dog "so that the pre-trained language model can be employed to mask the text t _i ^mask And extracting features to obtain masked text features, then carrying out mask recovery processing on the masked text features to generate reasonable predicted words which are not matched with the original image, for example, for word "woman", the words such as "boy", "man" or "cat" can be replaced according to semantic scenes, the replaced words are incorrect words (Incorrect word token), and other words except "woman" in the defect text sample are correct words (Correct word token), so that the defect text sample corresponding to the query text sample can be obtained according to the predicted words. Therefore, the query text sample carrying the local semantic errors can be obtained based on the defect text sample, so that the preset content recommendation model is trained according to the defect text sample, the preset content recommendation model can learn the local semantic information of the query text sample, the semantic information of the text can be identified in fine granularity, further, the content matched with the semantic of the query text can be more accurately retrieved, and the content recommendation accuracy is improved.

Optionally, the server performs visual feature fusion processing on the global visual features and the defect text word features to obtain the defect text word features after visual fusionFor example, please continue to refer to fig. 3, in the vision-language error modeling module based on the global visual feature, the server may add the global visual feature to each word feature in the defect text word feature to obtain the defect text word feature after visual fusion. For example, one can assume that the global visual features areThe defect text word features are (x 1, x2, x 3), wherein x1, x2 and x3 can be word features of each word in the defect text word features, so that the global visual features and each word feature in the defect text word features are added to obtain the defect text word features after visual fusion, wherein the defect text word features are->

In step 203, the server extracts a first target defect text word feature including a semantic association between the image sample and the defect text sample from the visually fused defect text word features, and predicts a first defect word in the defect text sample that does not match the image sample and a first correction word corresponding to the first defect word based on the first target defect text word feature.

The method for extracting the first target defect text word feature containing the semantic association relationship between the image sample and the defect text sample by the server in the visual fused defect text word feature can be multiple, for example, the server can perform association feature extraction on the visual fused defect text word feature to obtain a first association feature corresponding to each word feature in the visual fused defect text word feature, determine a first association weight corresponding to each word feature in the visual fused defect text word feature based on the first association feature, and perform fusion processing on the first association feature according to the first association weight to obtain the first target defect text word feature containing the semantic association relationship between the image sample and the defect text sample.

The method for extracting the associated feature of the visual fused defect text word feature by the server may have multiple ways of obtaining the first associated feature corresponding to each word feature in the visual fused defect text word feature, for example, please refer to fig. 3 continuously, the server may use a Self Attention network unit (Self Attention) to extract the feature of each word in the visual fused defect text word feature to obtain the first associated feature corresponding to each word feature, for example, may convert each word feature into a spatial vector with three dimensions, including a query vector (query, q), a key vector (k) and a value vector (value, v), and a specific conversion manner may be understood as that each word feature in the visual fused defect text word feature is fused with conversion parameters with three dimensions, and the query vector, the key vector and the value vector are taken as the first associated feature corresponding to each word feature in the visual fused defect text word feature.

After the server performs associated feature extraction on the visual fused defect text word features, a first associated weight corresponding to each word feature in the visual fused defect text word features can be determined based on the first associated features. The server may determine, based on the first association feature, a plurality of ways of determining the first association weight corresponding to each word feature in the visually fused defective text word features, for example, the server may use a self-attention network unit to dot product a query vector corresponding to each word feature in the visually fused defective text word features with key vectors of other word features, may obtain an attention score (score) corresponding to each word feature, and calculate, based on the attention score corresponding to each word feature, the first association weight corresponding to each word feature in the visually fused defective text word features.

The server can adopt a self-attention network to extract the characteristics of the visual fused defective text word characteristics, and can also adopt other networks which can capture the association relation between each word characteristic and other word characteristics in the visual fused defective text word characteristics, so as to further determine the weight of each word characteristic in all word characteristics.

After determining the first association weight corresponding to each word feature in the visual fused defect text word features based on the first association features, the server can conduct fusion processing on the first association features according to the first association weights to obtain first target defect text word features containing semantic association relations of the image samples and the defect text samples. The server may perform fusion processing on the first association feature according to the first association weight, for example, the server may perform weighting processing on the first association feature according to the first association weight and perform summation processing on the weighting processing structure, so as to obtain a first target defect text word feature including a semantic association relationship between an image sample and a defect text sample.

After the server extracts the first target defect text word characteristics containing the semantic association relationship between the image sample and the defect text word characteristics from the defect text word characteristics after visual fusion, the defect word which is not matched with the image sample in the defect text sample and the correction word corresponding to the defect word can be predicted based on the first target defect text word characteristics. The server predicts a defective word in the defective text sample, which is not matched with the image sample, and a correction word corresponding to the defective word based on the first target defective text word feature, for example, please continue to refer to fig. 3, in the vision-language error modeling module based on the global visual feature, the server may use two classification layers in the preset content recommendation model, predict a first defective word in the defective text sample, which is not matched with the image sample, and predict a probability that the first defective word is a word in error in the defective text sample based on the first target defective text word feature output by the text encoder, and then predict a plurality of first correction words corresponding to the first defective word and predict a probability of the first correction word, that is, a correct word corresponding to the first defective word in the query text sample, using the multi-classification layer in the preset content recommendation model.

In step 204, the server identifies a first difference word in the query text sample compared with the defect text sample based on the query text sample and the first defect word, and calculates first correctness prediction loss information corresponding to the preset content recommendation model according to the first difference word and the correctness prediction probability corresponding to the first defect word.

The manner of identifying the first difference word in the query text sample compared with the defect text sample based on the query text sample and the first defect word may be various, for example, assuming that the query text sample is "awoman plays with a dog", the defect text sample may be "a boy plays with a dog", the first difference word in the query text sample compared with the defect text sample may be identified as "woman" according to the query text sample and the first defect word, the first defect word is "boy", and accordingly, the preset content recommendation model predicts the first correction word corresponding to the first defect word, for example, may include first correction words such as "woman", "girl", "man", and outputs a correction prediction probability corresponding to each first correction word, for example, the correction prediction probability of "woman" may be 0.7, the correction prediction probability of "girl" may be 0.25, the correction prediction probability of "man" may be 0.05, and the like, and the correction prediction probability corresponding to the first correction word "woman" may be large, which may indicate that the accuracy of the preset content recommendation model is higher the better.

The method for calculating the first correctness prediction loss information corresponding to the preset content recommendation model according to the correctness prediction probabilities corresponding to the first difference word and the first defect word may be various, for example, a calculation formula of the first correctness prediction loss information corresponding to the preset content recommendation model may be expressed as:

wherein L is _det (h ^v ) Representing the first correctness prediction loss information, IE () may be represented as a desired loss symbol,can be expressed as a probability distribution of the first defective word detected at the j-th position in the defective text sample,/or->Can be used forExpressed as a defective text sample, < >>The probability of detecting the first defective word at the j-th position in the defective text sample, namely, the correctness prediction probability, can be expressed.

In step 205, the server determines first corrected predicted loss information corresponding to the preset content recommendation model based on the corrected prediction probabilities corresponding to the first difference word and the first corrected word, and determines first sub-loss information corresponding to the preset content recommendation model according to the first correctness predicted loss information and the first corrected predicted loss information.

Accordingly, the calculation formula of the first modified predicted loss information corresponding to the preset content recommendation model may be expressed as:

Wherein L is _cor (h ^v ) May be represented as first modified predictive loss information when the second visual feature is a global visual feature,may be expressed as a probability distribution of the predicted first modifier at the j-th position in the defective text sample,the probability of the first modifier predicted at the j-th position in the defective text sample, i.e., the modified predicted probability, may be expressed.

The server may determine the first sub-loss information corresponding to the preset content recommendation model according to the first correctness prediction loss information and the first correction prediction loss information in various manners, for example, the server may perform accumulation processing on the first correctness prediction loss information and the first correction prediction loss information to obtain the first sub-loss information corresponding to the preset content recommendation model.

In step 206, the server extracts target defect text word features including the association relation between each word in the query text sample from the defect text word features, and performs feature fusion processing on the local visual features and the target defect text word features to obtain second target defect text word features.

The server may extract the target defect text word feature including the association relationship between each word in the query text sample from the defect text word features in various manners, for example, please continue to refer to fig. 3, a Self Attention network unit (Self Attention) may be used to extract the target defect text word feature including the association relationship between each word in the query text sample from the defect text word feature.

After extracting target defect text word characteristics containing the association relation between each word in the query text sample from the defect text word characteristics, the server can perform characteristic fusion processing on the local visual characteristics and the target defect text word characteristics. The method for obtaining the second target defect text word feature by the server through feature fusion processing of the local visual feature and the target defect text word feature may be multiple, for example, the server may perform association feature extraction on the local visual feature and the target defect text word feature to obtain a visual association feature corresponding to the local visual feature and a text association feature corresponding to the target defect text word feature, determine a second association weight corresponding to the visual association feature based on the visual association feature and the text association feature, and fuse the visual association feature based on the second association weight to obtain the second target defect text word feature.

The server may extract the relevant features of the local visual feature and the target defect text word feature, obtain the visual relevant feature corresponding to the local visual feature, and obtain the text relevant feature corresponding to the target defect text word feature in various manners, for example, please continue to refer to fig. 3, for the visual-semantic error modeling module based on the local visual feature, the server may use a cross attention network unit to extract the relevant features of the local visual feature and the target defect text word feature, for example, the server may convert the word feature in the target defect text word feature into a query vector, and convert the local visual feature of the image sample into a key vector and a value vector, and a specific conversion manner may be understood as that the conversion parameters of the target defect text word feature and the local visual feature and the corresponding dimension are fused, and the corresponding query vector is used as the text relevant feature corresponding to each word feature in the target defect text feature, and the key vector and the value vector are used as the visual relevant feature corresponding to each visual feature in the local visual feature.

After extracting the association features of the local visual feature and the target defect text word feature, the server can determine a second association weight corresponding to the visual association feature based on the visual association feature and the text association feature. The server may determine the second association weight corresponding to the visual association feature based on the visual association feature and the text association feature in various manners, for example, the server may perform dot product on a query vector corresponding to the target defect text word feature and a key vector of the local visual feature by using a cross attention network, and may obtain an attention score of each local visual feature and the corresponding target defect text word feature, and calculate the second association weight of the visual association feature based on the attention score.

After determining the second association weight corresponding to the visual association feature based on the visual association feature and the text association feature, the server can perform fusion processing on the visual association feature based on the second association weight to obtain a second target defect text word feature. The method of the server for performing fusion processing on the visual association features based on the second association weights can be various, for example, the server can perform weighting processing on the visual association features according to the second association weights and sum the weighted processing results to obtain second target defect text word features.

In step 207, the server predicts a second defect word in the defect text sample that does not match the image sample and a second modifier corresponding to the second defect word according to the second target defect text word feature, and identifies a second difference word in the query text sample compared to the defect text sample based on the query text sample and the second defect word.

The server predicts the defective word that is not matched with the image sample and the correction word corresponding to the defective word in the defective text sample according to the characteristics of the second target defective text word, for example, please refer to fig. 3, in the vision-language error modeling module based on the local visual characteristics, the server may use two classification layers in the preset content recommendation model, predict the second defective word that is not matched with the image sample in the defective text sample and predict the probability that the second defective word is the wrong word in the defective text sample based on the characteristics of the second target defective text word output by the text encoder, and then may use multiple classification layers in the preset content recommendation model to predict a plurality of second correction words corresponding to the second defective word and predict the probability of the second correction word, that is, the correct word corresponding to the second defective word in the query text sample.

In step 208, the server calculates second correctness prediction loss information corresponding to the preset content recommendation model according to the correctness prediction probabilities corresponding to the second difference word and the second defect word, determines second corrected prediction loss information corresponding to the preset content recommendation model based on the corrected prediction probabilities corresponding to the second difference word and the second corrected word, and determines second sub-loss information corresponding to the preset content recommendation model according to the second correctness prediction loss information and the second corrected prediction loss information.

The server may calculate the second correctness prediction loss information corresponding to the preset content recommendation model according to the second difference word and the correctness prediction probability corresponding to the second defect word in various manners, for example, a calculation formula of the second correctness prediction loss information corresponding to the preset content recommendation model may be expressed as:

wherein,representing second correctness prediction loss information, +.>Can be expressed as local visual features, IE () can be expressed as a desired loss sign, +.>Can be expressed as a probability distribution of the second defective word detected at the j-th position in the defective text sample,/or->Can be represented as a defective text sample, +. >The probability of detecting the second defective word at the j-th position in the defective text sample, i.e., the correctness prediction probability, can be expressed.

Accordingly, the calculation formula of the second modified predicted loss information corresponding to the preset content recommendation model may be expressed as:

wherein,can be expressed as second modified predictive loss information, < >>Can be expressed as a probability distribution of the second modifier predicted at the j-th position in the defective text sample,>the probability of the predicted second modifier at the j-th position in the defective text sample, i.e., the modified predicted probability, may be expressed.

The method for determining the second sub-loss information corresponding to the preset content recommendation model according to the second correctness prediction loss information and the second correction prediction loss information may be various, for example, the second correctness prediction loss information and the second correction prediction loss information may be accumulated to obtain the second sub-loss information corresponding to the preset content recommendation model.

In step 209, the server performs fusion processing on the first sub-loss information and the second sub-loss information to obtain second loss information corresponding to the preset content recommendation model, performs convergence processing on the preset content recommendation model based on the first loss information and the second loss information to obtain a trained content recommendation model, and performs content recommendation processing on the query text through the trained content recommendation model.

Optionally, the method of the server fusing the first sub-loss information and the second sub-loss information to obtain the second loss information corresponding to the preset content recommendation model may be multiple, for example, the server may sum the first sub-loss information and the second sub-loss information to obtain the second loss information corresponding to the preset content recommendation model.

The server may perform convergence processing on the preset content recommendation model based on the first loss information and the second loss information, so that a plurality of ways of obtaining a trained content recommendation model may be provided, for example, the server may perform accumulation processing on the first loss information and the second loss information to obtain target loss information corresponding to the preset content recommendation model, so that the preset content recommendation model may be subjected to convergence processing according to the target loss information to obtain the trained content recommendation model.

The server may perform the content recommendation processing on the query text through the trained content recommendation model, for example, please refer to fig. 3, where the trained content recommendation model may be a model structure only including the image contrast learning module. The server can perform feature extraction on the content to be recommended through the trained content recommendation model to obtain visual features of the content to be recommended, input query text into the content recommendation model, perform feature extraction on the query text according to the content recommendation model to obtain query text features, map the visual features of the content to be recommended and the query text features into a shared semantic space, calculate similarity between the visual features and the query text features in the shared semantic space, screen at least one target content matched with the query text from the content to be recommended according to the similarity between the visual features and the query text features, and perform content recommendation processing on the target content.

The server may screen at least one target content matched with the query text from the content to be recommended according to the similarity between the visual feature and the query text feature, for example, the server may determine the content to be recommended with the highest similarity as the target content, or may sort the content to be recommended according to the similarity, and determine the sorted content to be recommended as the target content.

As can be seen from the above, in the embodiment of the present application, at least one first visual feature is extracted from an image sample through a preset content recommendation model by a server, a query text word feature is extracted from a query text sample corresponding to the image sample, a similarity between the image sample and the query text sample is calculated based on the first visual feature and the query text word feature, and first loss information corresponding to the preset content recommendation model is determined based on the similarity; the method comprises the steps that a server obtains a defect text sample corresponding to a query text sample, defect text word characteristics are extracted from the defect text sample, global visual characteristics and local visual characteristics corresponding to an image sample are obtained, and visual characteristic fusion processing is carried out on the global visual characteristics and the defect text word characteristics to obtain defect text word characteristics after visual fusion; the method comprises the steps that a server extracts first target defect text word characteristics containing semantic association relations between an image sample and a defect text sample from the defect text word characteristics after visual fusion, and predicts first defect words which are not matched with the image sample and first correction words corresponding to the first defect words in the defect text sample based on the first target defect text word characteristics; the server identifies a first difference word in the query text sample compared with the defect text sample based on the query text sample and the first defect word, and calculates first correctness prediction loss information corresponding to the preset content recommendation model according to the first difference word and the correctness prediction probability corresponding to the first defect word; the server determines first correction prediction loss information corresponding to the preset content recommendation model based on the correction prediction probability corresponding to the first difference word and the first correction word, and determines first sub-loss information corresponding to the preset content recommendation model according to the first correctness prediction loss information and the first correction prediction loss information; the server extracts target defect text word characteristics containing the association relation between each word in the query text sample from the defect text word characteristics, and performs characteristic fusion processing on the local visual characteristics and the target defect text word characteristics to obtain second target defect text word characteristics; the server predicts a second defect word which is not matched with the image sample in the defect text sample and a second correction word corresponding to the second defect word according to the characteristics of the second target defect text word, and identifies a second difference word which is compared with the defect text sample in the query text sample based on the query text sample and the second defect word; the server calculates second correctness prediction loss information corresponding to the preset content recommendation model according to the correctness prediction probabilities corresponding to the second difference words and the second defect words, determines second correction prediction loss information corresponding to the preset content recommendation model based on the correction prediction probabilities corresponding to the second difference words and the second correction words, and determines second sub-loss information corresponding to the preset content recommendation model according to the second correctness prediction loss information and the second correction prediction loss information; the server performs fusion processing on the first sub-loss information and the second sub-loss information to obtain second loss information corresponding to the preset content recommendation model, performs convergence processing on the preset content recommendation model based on the first loss information and the second loss information to obtain a trained content recommendation model, and performs content recommendation processing on the query text through the trained content recommendation model. According to the method, the similarity between the image sample and the query text sample is calculated through the first visual features of the image sample and the query text word features of the query text sample, so that first loss information of global semantic association between the image sample and the query text is determined according to the similarity, then the correlation between the global visual features, the local visual features and the local visual semantics of the image sample and the local text semantics is predicted according to the global visual features, the defect text word features of the defect text sample, the defect words which are not matched with the image sample and the correction words corresponding to the defect words in the defect text sample, and the second loss information of the local semantic association between the image sample and the query text is determined according to the defect words corresponding to the defect text sample, the correction words and the query text sample, so that the global visual features, the local visual features and the local visual semantic features of the image sample are interacted with the local semantic errors of the defect text sample, the preset content recommendation model is learned to the global visual semantics, the correlation between the global visual features, the local visual semantics and the local text semantics of the defect text semantics is further fully promoted, the defect text sample is detected and corrected, the correlation between the image sample and the query text is more precisely, and the text is more accurate, and the correlation between the recommended text is achieved according to the recommendation text is achieved.

In order to better implement the above method, the embodiment of the present application further provides a content recommendation device, which may be integrated in a computer device, and the computer device may be a server.

For example, as shown in fig. 5, a schematic structural diagram of a content recommendation device provided in an embodiment of the present application may include an extracting unit 301, a calculating unit 302, an obtaining unit 303, a predicting unit 304, a determining unit 305, a converging unit 306, and a recommending unit 307, as follows:

an extracting unit 301, configured to extract at least one first visual feature from an image sample through a preset content recommendation model, and extract a query text word feature from a query text sample corresponding to the image sample;

the calculating unit 302 is configured to calculate a similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determine first loss information corresponding to the preset content recommendation model based on the similarity;

an obtaining unit 303, configured to obtain a defect text sample corresponding to the query text sample, extract a defect text word feature from the defect text sample, and obtain at least one second visual feature corresponding to the image sample;

A prediction unit 304, configured to predict, according to the second visual feature and the defect text word feature, a defect word in the defect text sample that does not match the image sample, and predict a correction word corresponding to the defect word;

a determining unit 305, configured to determine second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample;

the convergence unit 306 is configured to perform convergence processing on the preset content recommendation model based on the first loss information and the second loss information, so as to obtain a trained content recommendation model;

and a recommending unit 307, configured to perform content recommendation processing on the query text through the trained content recommendation model.

In an embodiment, the second visual features include global visual features corresponding to the image samples, and the prediction unit 304 includes:

the first target defect text word feature extraction subunit is used for extracting first target defect text word features containing semantic association relations between the image samples and the defect text samples from the defect text word features after visual fusion;

And the first word prediction subunit is used for predicting the defect word which is not matched with the image sample in the defect text sample and the correction word corresponding to the defect word based on the first target defect text word characteristics.

In an embodiment, the first target defect text word feature extraction subunit comprises:

and the first fusion module is used for carrying out fusion processing on the first association features according to the first association weight to obtain first target defect text word features containing the semantic association relationship between the image sample and the defect text sample.

In an embodiment, the second visual features include local visual features corresponding to the image samples, and the prediction unit 304 includes:

the target defect text word feature extraction subunit is used for extracting target defect text word features containing association relations among each word in the query text sample from the defect text word features;

In one embodiment, a feature fusion subunit comprises:

the second association weight determining module is used for determining a second association weight corresponding to the visual association feature based on the visual association feature and the text association feature;

In one embodiment, the computing unit 302 includes:

the history feature acquisition subunit is used for acquiring momentum visual features corresponding to the first visual features and momentum text word features corresponding to the query text word features;

The sample pair constructing subunit is used for constructing a negative sample feature pair and a positive sample feature pair corresponding to the preset content recommendation model based on the first visual feature, the query text word feature, the momentum visual feature and the momentum text word feature;

and the similarity calculation subunit is used for calculating a first similarity between the negative sample feature pairs and a second similarity between the positive sample feature pairs, and determining the first similarity and the second similarity as the similarity between the image sample and the query text sample.

In an embodiment, the determining unit 305 includes:

the differential word recognition subunit is used for recognizing differential words in the query text sample compared with the defect text sample based on the query text sample and the defect words;

the second loss calculation subunit is used for determining correction prediction loss information corresponding to the preset content recommendation model based on the difference word and the correction prediction probability corresponding to the correction word;

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, in the embodiment of the present application, the extracting unit 301 extracts at least one first visual feature from the image sample through the preset content recommendation model, and extracts the feature of the query text word from the query text sample corresponding to the image sample; the computing unit 302 calculates the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determines first loss information corresponding to the preset content recommendation model based on the similarity; the obtaining unit 303 obtains a defect text sample corresponding to the query text sample, extracts a defect text word feature from the defect text sample, and obtains at least one second visual feature corresponding to the image sample; the prediction unit 304 predicts a defect word which is not matched with the image sample in the defect text sample according to the second visual feature and the defect text word feature, and predicts a correction word corresponding to the defect word; the determining unit 305 determines second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample; the convergence unit 306 converges the preset content recommendation model based on the first loss information and the second loss information to obtain a trained content recommendation model; the recommending unit 307 performs content recommendation processing on the query text by the trained content recommendation model. According to the method, the similarity between the image sample and the query text sample is calculated through the first visual features of the image sample and the query text word features of the query text sample, so that first loss information of global semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the similarity, then, defect words which are not matched with the image sample and correction words corresponding to the defect words in the defect text sample are predicted according to the second visual features of the image sample and the defect text word features of the defect text sample, so that second loss information of local semantic association between the image sample and the query text corresponding to the preset content recommendation model is determined according to the defect words, the correction words and the query text sample, the preset content recommendation model is converged based on the first loss information and the second loss information, the preset content recommendation model can learn the global semantic association and the local semantic association between the image sample and the query text, content recommendation processing is performed according to the trained content recommendation model, association between the local text of the query text and the multi-visual sense of the image can be identified, accordingly, the content recommendation degree is improved, and the accuracy of matching with the query content is improved.

The embodiment of the application also provides a computer device, as shown in fig. 6, which shows a schematic structural diagram of the computer device according to the embodiment of the application, where the computer device may be a server, specifically:

the computer device may include one or more processors 401 of a processing core, memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 6 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various function applications and content recommendation by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

extracting at least one first visual feature from an image sample through a preset content recommendation model, and extracting query text word features from a query text sample corresponding to the image sample; calculating the similarity between the image sample and the query text sample based on the first visual feature and the query text word feature, and determining first loss information corresponding to the preset content recommendation model based on the similarity; obtaining a defect text sample corresponding to the query text sample, extracting defect text word characteristics from the defect text sample, and obtaining at least one second visual characteristic corresponding to the image sample; predicting a defect word which is not matched with the image sample in the defect text sample according to the second visual characteristic and the defect text word characteristic, and predicting a correction word corresponding to the defect word; determining second loss information corresponding to the preset content recommendation model according to the defect word, the correction word and the query text sample corresponding to the defect text sample; based on the first loss information and the second loss information, carrying out convergence processing on the preset content recommendation model to obtain a trained content recommendation model; and carrying out content recommendation processing on the query text through the trained content recommendation model.

The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein. It should be noted that, the computer device provided in the embodiment of the present application and the content recommendation method applicable to the above embodiment belong to the same concept, and detailed implementation processes of the computer device are shown in the above method embodiment, which is not repeated here.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, an embodiment of the present application provides a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the content recommendation methods provided by the embodiments of the present application. For example, the instructions may perform the steps of:

Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the computer readable storage medium may execute the steps in any content recommendation method provided by the embodiments of the present application, the beneficial effects that any content recommendation method provided by the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations provided in the above embodiments.

The foregoing has outlined some of the more detailed description of the method, apparatus, computer readable storage medium and computer device for content recommendation in accordance with the embodiments of the present application, wherein the detailed description is provided for the purpose of illustrating the principles and embodiments of the present application and is provided for the purpose of providing a better understanding of the method and core concepts of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.

Claims

1. A content recommendation method, comprising:

2. The content recommendation method according to claim 1, wherein the second visual feature includes a global visual feature corresponding to the image sample, the predicting a defective word in the defective text sample that does not match the image sample and predicting a correction word corresponding to the defective word based on the second visual feature and a defective text word feature includes:

performing visual feature fusion processing on the global visual features and the defect text word features to obtain defect text word features after visual fusion;

extracting first target defect text word characteristics containing semantic association relations between the image samples and the defect text samples from the defect text word characteristics after visual fusion;

predicting a defect word which is not matched with the image sample in the defect text sample and a correction word corresponding to the defect word based on the first target defect text word characteristics.

3. The content recommendation method according to claim 2, wherein the extracting a first target defective text word feature including a semantic association between the image sample and the defective text sample from the visually fused defective text word features includes:

Extracting the associated features of the visual fused defect text word features to obtain first associated features corresponding to each word feature in the visual fused defect text word features;

determining a first association weight corresponding to each word feature in the visual fused defect text word features based on the first association features;

and carrying out fusion processing on the first association features according to the first association weights to obtain first target defect text word features containing semantic association relations between the image samples and the defect text samples.

4. The content recommendation method according to claim 1, wherein the second visual feature includes a local visual feature corresponding to the image sample, and the predicting a defective word in the defective text sample that does not match the image sample and predicting a correction word corresponding to the defective word based on the second visual feature and a defective text word feature includes:

extracting target defect text word characteristics containing the association relation between each word in the query text sample from the defect text word characteristics;

carrying out feature fusion processing on the local visual features and the target defect text word features to obtain second target defect text word features;

Predicting the defect word which is not matched with the image sample in the defect text sample and the correction word corresponding to the defect word according to the second target defect text word characteristics.

5. The content recommendation method according to claim 4, wherein the performing feature fusion processing on the local visual feature and the target defect text word feature to obtain a second target defect text word feature includes:

extracting relevant features of the local visual features and the target defect text word features to obtain visual relevant features corresponding to the local visual features and text relevant features corresponding to the target defect text word features;

determining a second association weight corresponding to the visual association feature based on the visual association feature and the text association feature;

and carrying out fusion processing on the visual association features based on the second association weight to obtain second target defect text word features.

6. The content recommendation method of claim 1 wherein said calculating a similarity between said image sample and a query text sample based on said first visual feature and said query text word feature comprises:

Acquiring momentum visual characteristics corresponding to the first visual characteristics and momentum text word characteristics corresponding to the query text word characteristics;

based on the first visual feature, the query text word feature, the momentum visual feature and the momentum text word feature, constructing a negative sample feature pair and a positive sample feature pair corresponding to the preset content recommendation model;

and calculating a first similarity between the negative sample feature pairs and a second similarity between the positive sample feature pairs, and determining the first similarity and the second similarity as the similarity between the image sample and the query text sample.

7. The content recommendation method according to any one of claims 1 to 6, wherein the determining second loss information corresponding to the preset content recommendation model according to the defect word, the modifier, and the query text sample corresponding to the defect text sample includes:

identifying a difference word in the query text sample compared with the defect text sample based on the query text sample and the defect word;

according to the difference word and the accuracy prediction probability corresponding to the defect word, calculating accuracy prediction loss information corresponding to the preset content recommendation model;

Determining correction prediction loss information corresponding to the preset content recommendation model based on the correction prediction probability corresponding to the difference word and the correction word;

and determining second loss information corresponding to the preset content recommendation model according to the correctness prediction loss information and the correction prediction loss information.

8. A content recommendation device, comprising:

9. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the content recommendation method of any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the content recommendation method according to any one of claims 1 to 7 when the computer program is executed.