CN113094534B

CN113094534B - Multi-mode image-text recommendation method and device based on deep learning

Info

Publication number: CN113094534B
Application number: CN202110385246.5A
Authority: CN
Inventors: 黄昭; 胡浩武
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2022-09-02
Anticipated expiration: 2041-04-09
Also published as: CN113094534A

Abstract

The invention discloses a multi-mode image-text recommendation method and equipment based on deep learning, wherein the method comprises the steps of using a cross-mode image-text retrieval model MMDNN, using the MMDNN in a recommendation system, using a positive and negative feedback clustering center calculation module PNFCCCM and a positive and negative feedback historical record of a user to calculate a positive and negative feedback clustering center of the user, finding out several pieces of data with the highest comprehensive score in the historical record of the user from a database by combining the similarity score and the positive and negative feedback score of the data, using the MMDNN model to find out data of another mode corresponding to the several pieces of data from the database, recommending paired image-text resources to the user, updating the historical record of the user and the positive and negative feedback clustering center of the user according to the feedback of the user, and realizing multi-mode image-text recommendation.

Description

Multi-mode image-text recommendation method and device based on deep learning

Technical Field

The invention belongs to the field of computer science and technology application, and particularly relates to a deep learning-based multi-modal image-text recommendation method and device.

Background

Currently, most recommendation systems focus on providing single mode content, such as recommending pictures using pictures and recommending texts using texts. In fact, the resources in different forms of pictures and texts have an unbalanced and complementary relationship when describing the same semantic meaning, and images can generally contain more details which cannot be displayed by texts, and texts have the advantage of expressing high-level meanings. Therefore, users need multi-modal combined information resources, and cross-modal retrieval technology is more interesting. Cross-modal retrieval is a technique that can return multiple modes in conjunction with information based on a user entering one mode of information. At present, many cross-modal retrieval methods are only applied to the field of retrieval, and are not applied to the field of recommendation systems. And the cross-modal retrieval methods have the defects of insufficient retrieval precision, long time consumption and the like. Most recommendation systems only consider the positive feedback of the user, where there is much information available in the negative feedback records of the user, and therefore, it is necessary to improve the quality and efficiency of the cross-modal search method.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a multi-mode image-text recommendation method and device based on deep learning, and the purpose of recommending the needed image-text combined information resources to the user according to the personal preference of the user is achieved by designing an efficient cross-modal image-text retrieval method and applying the method to a recommendation system.

In order to achieve the purpose, the invention adopts the technical scheme that the multi-modal image-text recommendation method based on deep learning comprises the following steps:

calculating a clustering center point of positive feedback and negative feedback of a user by adopting a cross-modal retrieval model based on a historical record of the user, wherein the historical record comprises an image and a text;

selecting the first N historical records with higher user scores from the user historical records;

extracting the characteristics of the N historical records, and obtaining the categories of the N historical records according to the characteristics;

extracting the data of the same type from the database with the same historical record modes by using a cross-mode retrieval model;

calculating similarity scores of the extracted data of the same type and the N historical records, arranging the similarity scores according to a reverse order, and selecting the historical records corresponding to the previous M similarity scores;

respectively calculating the positive feedback score and the negative feedback score of each history record in the M items by using the clustering center points of the positive feedback and the negative feedback;

calculating the total score of each data in the M historical records according to the similarity score of each of the M similarity scores, the positive feedback score and the negative feedback score, arranging the total scores in a reverse order, and selecting the first K data;

for each item of data in the K data, finding out K data corresponding to the item of data from a text database or an image database by using a cross-modal retrieval model;

and correspondingly combining the first K data with K data in a text database or an image database to form K image-text pairs, namely obtaining a recommendation result.

The cross-modal retrieval model is used for extracting data features, and the process of training the cross-modal retrieval model is divided into two stages:

in a first stage, for an image, extracting representations within an image modality and representations between image modalities with text information; for text, extracting representations in a text mode and representations between the text modes with image information;

in a second phase, combining the representation within the image modality and the representation between the image modalities to form a composite representation of the image; meanwhile, the representation in the text mode and the representation between the text modes are combined to form a text comprehensive representation, then a stack type corresponding self-encoder and a constraint function are utilized to establish the relation between the comprehensive representation of the image and the text, and the final representation of the image and the text is learned.

The cross-modal retrieval model is obtained by training through the following processes:

the method comprises the steps of firstly extracting image features by adopting a MobileNet 3-large model with the last classification layer removed, and on the basis of the initial extraction of the image features, on one hand, extracting representation in an image modality by using AE (extraction algorithm), namely the representation in the image modality with intra-modality information; on the one hand, RBM extraction is used to obtain a further representation of the image, which is to be used to form an inter-modality representation of the image with textual information;

preliminarily extracting text features by using a TF-IDF algorithm; on the one hand, on the basis of the preliminary extraction of text features, DAE is used for extracting representations in a text mode, namely the representations in the text mode with intra-mode information; using RSRBM extraction on the one hand to derive a text further representation to be used to form a text inter-modality representation with image information;

based on the further representation of the image and the further representation of the text, the invention extracts an inter-modal representation of the image and the text with a Multimodal DBN; performing Gibbs sampling alternately between the image and the text representation at the top layer of the Multimodal DBN, namely obtaining an inter-image modality representation with text characteristics and an inter-text modality representation with image characteristics;

the intra-modality representation and the inter-modality representation of each modality are fused using two joint-RBM models,

a join-RBM model fuses the representation in the image mode and the representation between the image modes to obtain the comprehensive representation of the image; the other joint-RBM model fuses the representation in the text mode and the representation between the text modes to obtain the comprehensive representation of the text;

respectively carrying out classification training on the comprehensive representation of the image and the comprehensive representation of the text by using two DAEs so as to extract the optimal hidden layer number of the image and text characteristics;

fixing the optimal hidden layers of the extracted images and texts, and aligning the optimal hidden layers of the images and the texts one by one to form a stack type corresponding self-encoder;

in the stack type corresponding self-encoder, the association constraint function is used for reusing the comprehensive representation of the second-stage image and the comprehensive representation of the second-stage text to train the stack type corresponding self-encoder, so that the stack type corresponding self-encoder can establish the relation between the representations of the image and the text while obtaining the final representation of the image and the text.

When extracting representations between image and text modalities with a Multimodal DBN: firstly, inputting the preliminary representation of the text into an RSRBM model, wherein an RSRBM energy function is as follows:

wherein v is _i Is the value of the ith node of the input layer, h _j Value of j-th node of hidden layer, w _ij As a weight between the input layer and the hidden layer, b _i Is the ith input layerOffset of the ith node, a _j Is the bias of the jth node of the hidden layer, m is the sum of the discrete values of the visible layer;

taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the node numbers of the two hidden layers are 2048 and 1024 respectively, and the activation function is set as a sigmoid activation function; then, at the union layer of Multimodal DBNs, alternating gibbs sampling is performed using the following formula, obtaining a characterization with inter-modal information,

σ(x)＝1/(1+e ^-x )

and

for generating a distribution over the data of each modality,

is a layer 1 hidden layer of the image input, sigma () is a sigmoid activation function,

is the weight on layer 2 of the image, a _t For the offset of the last layer of text,

a layer 2 hidden layer for text input,

as weights on the layer 2 hidden layer of text, a _i For the bias of the last layer of the image, x is the input to the activation function and e is a natural number.

The association constraint function is:

wherein the content of the first and second substances,

and

the input of the images and the text is performed,

and

parameters representing the image and the text are shown,

and

for the representation of hidden layers of images and text, the loss function in the stacked corresponding auto-encoder is:

wherein:

and

representing the reconstruction errors of the image and text self-encoders,

representing the associated constrained error of the image and text,

is a representation in a hidden layer of the jth layer of the image in a stacked self-encoder,

is a representation in the jth layer reconstruction layer of an image in a stacked self-encoder,

is a representation in a hidden layer at the jth layer of text in a stacked self-encoder,

is a representation in a jth layer reconstruction layer of text in a stacked self-encoder; theta represents all parameters of the j-th layer in the stacked self-encoder;

the objective function of the integral adjustment stack type corresponding self-encoder is as follows:

x ₀ and y ₀ Input feature vector, x, for images and text _2h And y _2h For their corresponding reconstructed feature representations, δ (q) is L for all parameters in the stacked corresponding auto-encoder ₂ And (4) regularizing the representation.

Respectively calculating the central points of positive feedback and negative feedback clustering in the user history record by using a K-means method, wherein the specific process is as follows:

acquiring a history record of a user, wherein the history record comprises positive feedback and negative feedback records;

extracting feature representations of the positive feedback and negative feedback data using an MMDNN model;

respectively calculating the distance between the feature representations of the positive feedback data and the negative feedback data by using Euclidean distance;

and respectively calculating by using a K-means method to obtain the central points of positive feedback and negative feedback clustering in the user record.

And when the positive feedback score and the negative feedback score of each historical record in the M items are respectively calculated by using the clustering center points of the positive feedback and the negative feedback, calculating the distance between the characteristic of the alternative picture or the text data and the positive feedback center and the negative feedback center of the user, and using the sum of the reciprocal of the distance between the characteristic of the image or the text data and the clustering center points of the positive feedback and the negative feedback as the positive feedback score and the negative feedback score of the data.

When the total score of each item of data in the M items of history records is calculated, the similarity score, the positive feedback score and the negative feedback score are combined in a weighting mode to be used as the total score of the image and the text data, and a specific weighting formula is as follows:

S _i is the total score of the image or text,

is a similarity score for an image or text,

and positive and negative feedback scores of the image or the text are obtained, alpha is the weight of the image or the text, and i represents the ith candidate picture.

A computer device comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and when the processor executes part or all of the computer executable program, the multi-modal graph-text recommendation method can be realized.

A computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, is capable of implementing the multi-modal teletext recommendation method of the invention.

Compared with the prior art, the invention has at least the following beneficial effects:

according to the requirements of users for multi-modal combined information, the invention provides a cross-modal recommendation method, the combined image and text information resource which is possibly interested by the user can be recommended to the user according to the interest preference of the user, the invention is a recommendation method for recommending multi-modal combined information to the user for the first time, in order to realize the function of recommending multi-modal combined information, the invention adopts a cross-modal retrieval model which has the advantages that compared with the traditional cross-modal retrieval model, meanwhile, the invention has the advantages of higher model training speed and higher retrieval precision, also considers the positive feedback and negative feedback information of the user to the system recommended resources, combines the positive feedback and negative feedback information of the user with the cross-mode retrieval model, and multi-mode combined information resources are recommended to the user according to the interest preference of the user, so that multi-mode resource recommendation in different forms of pictures and texts is realized. Meanwhile, the invention can also timely update the clustering center of the positive feedback and the negative feedback in the invention according to the real-time positive feedback and negative feedback information of the user, thereby realizing the function that the recommended content changes according to the change of the interest and hobbies of the user. The invention improves the quality and efficiency of the cross-modal retrieval method, realizes the application of the cross-modal retrieval in the field of recommendation systems, does not depend on excessive historical data, and effectively improves the efficiency and accuracy of recommendation.

Furthermore, the MobileNet 3-large model has the characteristics of high precision, high speed and the like in the aspect of extracting image features and is used for extracting the image features; the TF-IDF algorithm is used in a text classification task and can consider the weights of different words in the text; the linear and non-linear relationships between data can be efficiently extracted for representation DAE within the text modality.

Drawings

FIG. 1 is a schematic diagram of a recommendation method that may be implemented according to the present invention.

Fig. 2 is a schematic diagram of a novel cross-modal search method.

Detailed Description

The technical scheme of the invention is further explained by combining the drawings and the examples.

Using a cross-mode image-text retrieval model MMDNN (Multimodal Deep Neural network) to use the MMDNN in a recommendation system, and calculating a positive and Negative Feedback clustering Center of a user by using a positive and Negative Feedback clustering Center Calculation module PNFCCCM (Positive and Negative Feedback Cluster Center Calculation module) and a positive and Negative Feedback historical record of the user; and combining the similarity score and the positive and negative feedback scores of the data to find out the data with the highest comprehensive score in the user history record from the database. And finding out data of another modality corresponding to the data from the database by using the MMDNN model. And finally, recommending the paired graph-text resources to the user, and updating the historical record of the user and the positive and negative feedback clustering center of the user according to the feedback of the user to realize multi-mode graph-text recommendation.

Referring to fig. 1, a deep learning based multi-modal teletext recommendation method comprises the following steps:

referring to fig. 2, step 1, the cross modal search model MMDNN is trained using the Wikipedia dataset.

Step 1.1, extracting the characteristics of the pictures by using the MobileNet V3, and for the image characteristic extraction, preprocessing the images, converting black and white pictures into three-channel color pictures, and unifying the left and right pictures into a size of 224 × 224 to obtain an image set.

And dividing the image set into a training set and a testing set, sending the training set and the testing set into a MobileNet 3-large model to execute a classification task, and stopping training when the accuracy of the testing set reaches the highest.

MobileNet with the last classification layer removedV3-large extracts picture features to finally obtain 1280-dimensional preliminary image feature representation I _m . In a Multimodal DBN module, the dimension of the top layer is 1024 dimensions, so the invention uses the preliminary image feature I of 1280 dimensions _m Dimension reduction is carried out through a layer AE (AutoEncoder) so that the dimension of an output is 1024 dimensions, and the functions are as follows:

h＝f(x)＝l _f (w _f x+b _h )

r＝g(h)＝l _g (w _g x+b _r )

L(x,r)＝||r-x|| ²

h represents a hidden layer, r represents a reconstruction layer, f (x) and g (h) are activation functions, and sigmoid activation functions are used in the invention; w is a _f And w _g Is a weight; b _h And b _r Is an offset; l (x, r) is a reconstruction error function. AE in the invention is trained by minimizing an error function, and the final output 1024-dimensional image feature is represented as I _a 。

Step 1.2, obtaining a text preliminary representation by using a TF-IDF algorithm, which is specifically as follows:

firstly, removing stop words in a text by using an NLTK tool, then calculating a TF-IDF value of each word in each text by using a TF-IDF algorithm, then arranging the words in each document according to the TF-IDF value in a reverse order, and selecting the TF-IDF value of the first 3000 words as the primary representation of the text; counting words in all documents, and coding each document according to a uniform word sequence; 3000 words in each document are represented by TF-IDF values of the words at positions corresponding to the total vocabulary, and the remaining positions are filled with 0.

The final dimensionality of the text preliminary representation extracted by the TF-IDF is too large, so that the dimensionality is reduced to 3000 dimensionality by using a PCA algorithm, and the text after dimensionality reduction is represented as T _p 。

Step 1.3, using DAE to extract representation in text mode, specifically:

the input of DAE is a preliminary representation T after PCA dimension reduction _p Its hidden layer is set to 2 layers with dimensions of 2048 and 1024 respectively.

As an example, the present invention trains the DAE by minimizing an objective function, which is as follows:

wherein L is _r (x,x _2h ) For its reconstruction error, w _e And w _d Is the weight of the encoder and decoder, p is the p-th hidden layer, h is the number of hidden layers,

is L ₂ Regularization with final output as a representation T within the text modality _d 。

Step 1.4, extracting the representation between the image and the text modality by using Multimodal DBN (Deep Belief Network). The Multimodal DBN model has two inputs, namely an image representation and a text representation, wherein the image representation is input as the preliminary image feature I of 1280 dimensions described in step 1.1 _m And (4) enabling the dimension to pass through a layer of RBM, reducing the dimension to 1024, wherein an activation function used in dimension reduction is a sigmoid activation function.

The input of the text representation is the reduced-dimension text representation T _p Then the reduced text is represented by T _p Inputting an RSRBM model; RSRBMs are often used to process data that is discrete in value, with an energy function expressed as:

wherein v is _i Is the value of the ith node of the input layer, h _j Value of j-th node of hidden layer, w _ij As a weight between the input layer and the hidden layer, b _i Is the bias of the ith node of the ith input layer, a _j Is the bias of the jth node of the hidden layer and m is the sum of the discrete values of the visible layer.

Taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the node numbers of the two hidden layers are 2048 and 1024 respectively, and the activation function is set as a sigmoid activation function. Then, at the joint layer of the Multimodal DBN, alternate gibbs sampling is performed using the following formula.

σ(x)＝1/(1+e ^-x )

And

for generating a distribution over the data of each modality,

is the weight on layer 2 of the image, a _t For the offset of the last layer of the text,

a layer 2 hidden layer for text input,

as a weight on the layer 2 hidden layer of the text, a _i The offset of the last layer of the image, x is the input of the activation function, and e is a natural number; the final output of the Multimodal DBN is two: inter-image-modality representation with textual information and inter-text-modality representation with image information, i.e. one is an inter-image-modality representation Y with textual features _i One is a text inter-modality representation Y with image features _t 。

Step 1.5, the intra-modality representation and the inter-modality representation of each modality are fused by two join-RBMs.

The input of the image join-RBM is I _a And Y _i The output is I ₀ (ii) a The input of the text join-RBM is T _d And Y _t The output is T ₀ (ii) a The image join-RBM model fuses the image modality internal representation and the image modality inter-representation to obtain the comprehensive representation I of the second stage image ₀ (ii) a The text join-RBM model fuses the representation in the text mode and the text mode to obtain the comprehensive representation T of the text in the first stage ₀

Step 1.6, two DAEs are used for I respectively ₀ And T ₀ Performing classification training with I ₀ And T ₀ As an embodiment, the present invention determines the number of the optimal hidden layers to be 3, and the number of nodes of each hidden layer is set to be 512, 256, and 64, respectively.

And step 1.7, corresponding the hidden layers of the image and the hidden layers of the text one by one to form a new self-encoder, namely a stack type corresponding self-encoder.

Step 1.8, applying an association constraint function in the stack-type corresponding self-encoder, training the stack-type corresponding self-encoder layer by layer from bottom to top by minimizing an objective function, and then adjusting all self-encoders on the whole, so that the stack-type corresponding self-encoder can establish a relation between the representations of the images and the texts while obtaining the final representations of the images and the texts.

Taking the association constraint function of the j-th layer as an example:

wherein the content of the first and second substances,

and

the input of the image and the text is performed,

and

parameters representing the image and the text are shown,

and

for the representation of hidden layers of images and text, finally, the loss function of the j-th layer of the SCAE can be expressed as:

the correlation formula in the above formula is as follows:

and

representing the reconstruction errors of the image and text self-encoders,

representing the associated constraint error of the image and text.

The objective function at the global adjustment stage is:

The method and the device calculate the central point of the positive and negative feedback clusters of the user by using the trained cross-modal retrieval model MMDNN based on the historical records of the user.

Step 2.1, firstly, obtaining the historical record of the user, wherein the historical record comprises the name of the user, the browsing record and the score of the user, the full score is 5, the score of 3 and above is regarded as positive feedback, and the score of 3 and below is regarded as negative feedback. The emphasis points of users at different stages on image or text resources are different, and if the users are more emphasized on pictures, image and text resources are recommended to the users according to picture records of the users; and if the user focuses more on the text resources, recommending the image-text resources to the user according to the text records browsed by the user.

As an example, the following is a process of recommending the image-text resource to the user according to the picture record of the user, and the process of recommending the image-text resource to the user according to the text record of the user is similar to this.

And 2.2, the interests and hobbies of the user change along with time, and the positive and negative feedback clustering center points of the user are calculated according to the historical records of the latest 50 pictures of the user. Less than 50 sheets were taken out completely.

The invention obtains 64-dimensional final representation of 50 pictures of a person by using MMDNN, the distance between any two pictures is calculated by Euclidean distance, and the formula is as follows:

wherein, I _f1 And I _f2 The final representation of the two pictures respectively.

Then, the positive and negative feedback central points recorded by the user are respectively calculated by using a K-means algorithm, wherein the K value is 1.

Step 3, as an example, the invention selects the first 20 records from the positive feedback records of the user, and takes all the records with less than 20. And make recommendations to the user in accordance therewith.

Step 4, extracting the final representation of 20 pictures through the MMDNN model, wherein the final representation of 20 pictures (I) _f1 ,I _f2 ,I _f3 ……I _f18 ,I _f19 ,I _f20 ) And find the category to which the 20 final representations belong. For convenience of description herein, it is assumed that 20 final representations belong to the category of "landscape".

And 5, processing all pictures in the 'landscape' sub-database in the picture database (but not including the pictures browsed by the user) by MMDNN (media distance network) to obtain final representation (I) of all pictures _r1 ,I _r2 ,I _r3 ,I _r4 ,I _r5 ……)。

And 6, calculating the similarity scores of the final representation of each picture obtained in the step 5 and the 20 records of the user. As an example of an implementation, the present invention uses cosine similarity, and the similarity score is expressed as follows:

wherein the content of the first and second substances,

representative Picture I _ri The score of the degree of similarity of (a),

is picture I _ri And I _fj The calculation formula of the similarity score is as follows:

wherein A and B are two n-dimensional vectors, A _i And B _i Respectively represent the ith characteristics of a vector A and a vector B, wherein A is a picture I _ri B is picture I _fj Calculating the similarityAfter scoring, the similarity scores are arranged in the reverse order, and the top M pictures are taken as alternatives. As an example, M is set to 10 in the present invention.

Step 7, calculating the distance between each of the 10 candidate pictures and the positive and negative feedback clustering center of the user, and using the sum of reciprocal distances from each picture to the positive feedback center and the negative feedback center as the positive and negative feedback score of the picture, wherein the correlation formula is as follows:

wherein, I _ri And representing the ith alternative picture, wherein X is a positive feedback clustering central point, and Y is a negative feedback clustering central point.

And (4) scoring positive and negative feedback of the ith candidate picture.

Step 8, calculating the total score of each alternative picture by combining the similarity score and the positive and negative feedback score of each alternative picture, arranging the total scores in a reverse order, and taking the first K items as the last alternative recommended picture; as an example, in the present invention, K is set to 5, and the similarity score, the positive feedback score and the negative feedback score are combined in a weighted manner, and the specific formula is as follows:

wherein S is _i Is the final score of the ith picture, α is the weight of the picture similarity score, and i represents the ith candidate picture.

Step 9, for each picture in the obtained alternative recommended pictures, finding a text resource corresponding to the picture from a text database by using an MMDNN model;

and step 10, combining the obtained alternative recommended pictures and the corresponding texts thereof to form 5 pairs of picture-text resources, namely forming a result recommended to the user.

The invention also updates the history record and the positive and negative feedback central point of the user according to the feedback of the user.

As an alternative embodiment, the present invention further provides a computer device, which includes a processor and a memory, where the memory is used to store a computer-executable program, the processor reads part or all of the computer-executable program from the memory and executes the computer-executable program, and when the processor executes part or all of the computer-executable program, the processor can implement part or all of the steps of the multimodal teletext recommendation method according to the present invention, and the memory is further used to store a history of a user.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, is capable of implementing the multimodal teletext recommendation method of the invention.

The computer equipment can adopt a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.

The invention also provides an output device for outputting the prediction result, wherein the output device is linked with the output end of the processor, and the output device is a display or a printer.

The processor of the present invention may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC) or a ready-made programmable gate array (FPGA).

The memory of the present invention may be an internal storage unit of a laptop, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory, a hard disk: external memory units such as removable hard disks, flash memory cards may also be used.

Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), solid State drive (ssd), or optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a dynamic Random Access Memory (dram).

Claims

1. A multi-modal graph-text recommendation method based on deep learning is characterized by comprising the following steps:

extracting the data of the same type from the database with the same history recording modes by using a cross-mode retrieval model;

correspondingly combining the first K data with K data in a text database or an image database to form K image-text pairs, namely obtaining a recommendation result; the cross-modal retrieval model is obtained by training through the following processes:

a join-RBM model fuses the representation in the image mode and the representation between the image modes to obtain the comprehensive representation of the image; the other join-RBM model fuses the representation in the text mode and the representation between the text modes to obtain the comprehensive representation of the text;

in the stack type corresponding self-encoder, an association constraint function is used, the comprehensive representation of the second-stage image and the comprehensive representation of the second-stage text are reused to train the stack type corresponding self-encoder, so that the stack type corresponding self-encoder can establish a relation between the representations of the image and the text while obtaining the final representation of the image and the text;

when the clustering center points of the positive feedback and the negative feedback are used for respectively calculating the positive feedback score and the negative feedback score of each historical record in the M items: and calculating the distance between the alternative picture or text data characteristic and the positive feedback center and the negative feedback center of the user, and using the sum of the reciprocal of the distance between the image or text data characteristic and the positive and negative feedback cluster center point as the positive and negative feedback score of the data.

2. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein a cross-modal search model is used for data feature extraction, and the process of cross-modal search model training is divided into two stages:

3. The deep learning based Multimodal teletext recommendation method according to claim 1, wherein when a Multimodal DBN is used to extract the representation between image and text modalities: firstly, inputting the preliminary representation of the text into an RSRBM model, wherein an RSRBM energy function is as follows:

wherein v is _i Is the value of the ith node of the input layer, h _j Value of j-th node of hidden layer, w _ij As a weight between the input layer and the hidden layer, b _i Is the bias of the ith node of the ith input layer, a _j Is the bias of the jth node of the hidden layer, m is the sum of the discrete values of the visible layer;

taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the number of nodes of the two hidden layers is 2048 and 1024 respectively, and an activation function is set as a sigmoid activation function; then, at the union layer of Multimodal DBNs, alternating gibbs sampling is performed using the following formula, obtaining a characterization with inter-modal information,

σ(x)＝1/(1+e ^-x )

and

for generating a distribution over the data of each modality,

is a weight on layer 2 of the image, a _t For the offset of the last layer of text,

a layer 2 hidden layer for text input,

as a weight on the layer 2 hidden layer of the text, a _i For the bias of the last layer of the image, x is the input to the activation function and e is a natural number.

4. The deep learning based multimodal teletext recommendation method according to claim 1, wherein said association constraint function is:

wherein, the first and the second end of the pipe are connected with each other,

and

the input of the image and the text is performed,

and

parameters representing the image and the text are shown,

and

for the representation of hidden layers of images and text, the loss function in the stacked corresponding self-encoder is:

wherein:

and

representing the reconstruction errors of the image and text self-encoders,

representing the associated constrained error of the image and text,

is a representation in the jth layer reconstruction layer of the image in a stacked self-encoder,

is a representation in the jth layer reconstruction layer of text in a stacked self-encoder; theta represents all parameters of the j-th layer in the stacked self-encoder;

5. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein the central points of positive feedback and negative feedback clustering in the user history record are calculated by a K-means method respectively, and the specific process is as follows:

acquiring a history record of a user, wherein the history record comprises a positive feedback record and a negative feedback record;

extracting feature representation of the positive feedback and negative feedback data by using a cross-modal graph-text retrieval model;

respectively calculating the distance between the characteristic representations of the positive feedback data and the negative feedback data by using Euclidean distance;

6. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein when the total score of each item of data in the M items of history records is calculated, the similarity score, the positive feedback score and the negative feedback score are combined in a weighting manner to serve as the total score of the image and text data, and a specific weighting formula is as follows:

S _i is the total score of the image or text,

is a similarity score for an image or text,

7. A computer device comprising a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can realize the multi-modal graph-text recommendation method according to any one of claims 1-6 when executing part or all of the computer executable program.

8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of multimodal teletext recommendation according to any one of claims 1-6.