CN113094534B - Multi-mode image-text recommendation method and device based on deep learning - Google Patents

Multi-mode image-text recommendation method and device based on deep learning Download PDF

Info

Publication number
CN113094534B
CN113094534B CN202110385246.5A CN202110385246A CN113094534B CN 113094534 B CN113094534 B CN 113094534B CN 202110385246 A CN202110385246 A CN 202110385246A CN 113094534 B CN113094534 B CN 113094534B
Authority
CN
China
Prior art keywords
text
image
representation
layer
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110385246.5A
Other languages
Chinese (zh)
Other versions
CN113094534A (en
Inventor
黄昭
胡浩武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202110385246.5A priority Critical patent/CN113094534B/en
Publication of CN113094534A publication Critical patent/CN113094534A/en
Application granted granted Critical
Publication of CN113094534B publication Critical patent/CN113094534B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/535Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a multi-mode image-text recommendation method and equipment based on deep learning, wherein the method comprises the steps of using a cross-mode image-text retrieval model MMDNN, using the MMDNN in a recommendation system, using a positive and negative feedback clustering center calculation module PNFCCCM and a positive and negative feedback historical record of a user to calculate a positive and negative feedback clustering center of the user, finding out several pieces of data with the highest comprehensive score in the historical record of the user from a database by combining the similarity score and the positive and negative feedback score of the data, using the MMDNN model to find out data of another mode corresponding to the several pieces of data from the database, recommending paired image-text resources to the user, updating the historical record of the user and the positive and negative feedback clustering center of the user according to the feedback of the user, and realizing multi-mode image-text recommendation.

Description

Multi-mode image-text recommendation method and device based on deep learning
Technical Field
The invention belongs to the field of computer science and technology application, and particularly relates to a deep learning-based multi-modal image-text recommendation method and device.
Background
Currently, most recommendation systems focus on providing single mode content, such as recommending pictures using pictures and recommending texts using texts. In fact, the resources in different forms of pictures and texts have an unbalanced and complementary relationship when describing the same semantic meaning, and images can generally contain more details which cannot be displayed by texts, and texts have the advantage of expressing high-level meanings. Therefore, users need multi-modal combined information resources, and cross-modal retrieval technology is more interesting. Cross-modal retrieval is a technique that can return multiple modes in conjunction with information based on a user entering one mode of information. At present, many cross-modal retrieval methods are only applied to the field of retrieval, and are not applied to the field of recommendation systems. And the cross-modal retrieval methods have the defects of insufficient retrieval precision, long time consumption and the like. Most recommendation systems only consider the positive feedback of the user, where there is much information available in the negative feedback records of the user, and therefore, it is necessary to improve the quality and efficiency of the cross-modal search method.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a multi-mode image-text recommendation method and device based on deep learning, and the purpose of recommending the needed image-text combined information resources to the user according to the personal preference of the user is achieved by designing an efficient cross-modal image-text retrieval method and applying the method to a recommendation system.
In order to achieve the purpose, the invention adopts the technical scheme that the multi-modal image-text recommendation method based on deep learning comprises the following steps:
calculating a clustering center point of positive feedback and negative feedback of a user by adopting a cross-modal retrieval model based on a historical record of the user, wherein the historical record comprises an image and a text;
selecting the first N historical records with higher user scores from the user historical records;
extracting the characteristics of the N historical records, and obtaining the categories of the N historical records according to the characteristics;
extracting the data of the same type from the database with the same historical record modes by using a cross-mode retrieval model;
calculating similarity scores of the extracted data of the same type and the N historical records, arranging the similarity scores according to a reverse order, and selecting the historical records corresponding to the previous M similarity scores;
respectively calculating the positive feedback score and the negative feedback score of each history record in the M items by using the clustering center points of the positive feedback and the negative feedback;
calculating the total score of each data in the M historical records according to the similarity score of each of the M similarity scores, the positive feedback score and the negative feedback score, arranging the total scores in a reverse order, and selecting the first K data;
for each item of data in the K data, finding out K data corresponding to the item of data from a text database or an image database by using a cross-modal retrieval model;
and correspondingly combining the first K data with K data in a text database or an image database to form K image-text pairs, namely obtaining a recommendation result.
The cross-modal retrieval model is used for extracting data features, and the process of training the cross-modal retrieval model is divided into two stages:
in a first stage, for an image, extracting representations within an image modality and representations between image modalities with text information; for text, extracting representations in a text mode and representations between the text modes with image information;
in a second phase, combining the representation within the image modality and the representation between the image modalities to form a composite representation of the image; meanwhile, the representation in the text mode and the representation between the text modes are combined to form a text comprehensive representation, then a stack type corresponding self-encoder and a constraint function are utilized to establish the relation between the comprehensive representation of the image and the text, and the final representation of the image and the text is learned.
The cross-modal retrieval model is obtained by training through the following processes:
the method comprises the steps of firstly extracting image features by adopting a MobileNet 3-large model with the last classification layer removed, and on the basis of the initial extraction of the image features, on one hand, extracting representation in an image modality by using AE (extraction algorithm), namely the representation in the image modality with intra-modality information; on the one hand, RBM extraction is used to obtain a further representation of the image, which is to be used to form an inter-modality representation of the image with textual information;
preliminarily extracting text features by using a TF-IDF algorithm; on the one hand, on the basis of the preliminary extraction of text features, DAE is used for extracting representations in a text mode, namely the representations in the text mode with intra-mode information; using RSRBM extraction on the one hand to derive a text further representation to be used to form a text inter-modality representation with image information;
based on the further representation of the image and the further representation of the text, the invention extracts an inter-modal representation of the image and the text with a Multimodal DBN; performing Gibbs sampling alternately between the image and the text representation at the top layer of the Multimodal DBN, namely obtaining an inter-image modality representation with text characteristics and an inter-text modality representation with image characteristics;
the intra-modality representation and the inter-modality representation of each modality are fused using two joint-RBM models,
a join-RBM model fuses the representation in the image mode and the representation between the image modes to obtain the comprehensive representation of the image; the other joint-RBM model fuses the representation in the text mode and the representation between the text modes to obtain the comprehensive representation of the text;
respectively carrying out classification training on the comprehensive representation of the image and the comprehensive representation of the text by using two DAEs so as to extract the optimal hidden layer number of the image and text characteristics;
fixing the optimal hidden layers of the extracted images and texts, and aligning the optimal hidden layers of the images and the texts one by one to form a stack type corresponding self-encoder;
in the stack type corresponding self-encoder, the association constraint function is used for reusing the comprehensive representation of the second-stage image and the comprehensive representation of the second-stage text to train the stack type corresponding self-encoder, so that the stack type corresponding self-encoder can establish the relation between the representations of the image and the text while obtaining the final representation of the image and the text.
When extracting representations between image and text modalities with a Multimodal DBN: firstly, inputting the preliminary representation of the text into an RSRBM model, wherein an RSRBM energy function is as follows:
Figure BDA0003014484020000041
wherein v is i Is the value of the ith node of the input layer, h j Value of j-th node of hidden layer, w ij As a weight between the input layer and the hidden layer, b i Is the ith input layerOffset of the ith node, a j Is the bias of the jth node of the hidden layer, m is the sum of the discrete values of the visible layer;
taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the node numbers of the two hidden layers are 2048 and 1024 respectively, and the activation function is set as a sigmoid activation function; then, at the union layer of Multimodal DBNs, alternating gibbs sampling is performed using the following formula, obtaining a characterization with inter-modal information,
Figure BDA0003014484020000042
Figure BDA0003014484020000043
σ(x)=1/(1+e -x )
Figure BDA0003014484020000044
and
Figure BDA0003014484020000045
for generating a distribution over the data of each modality,
Figure BDA0003014484020000046
is a layer 1 hidden layer of the image input, sigma () is a sigmoid activation function,
Figure BDA0003014484020000047
is the weight on layer 2 of the image, a t For the offset of the last layer of text,
Figure BDA0003014484020000048
a layer 2 hidden layer for text input,
Figure BDA0003014484020000049
as weights on the layer 2 hidden layer of text, a i For the bias of the last layer of the image, x is the input to the activation function and e is a natural number.
The association constraint function is:
Figure BDA00030144840200000410
wherein the content of the first and second substances,
Figure BDA00030144840200000411
and
Figure BDA00030144840200000412
the input of the images and the text is performed,
Figure BDA00030144840200000413
and
Figure BDA00030144840200000414
parameters representing the image and the text are shown,
Figure BDA00030144840200000415
and
Figure BDA00030144840200000416
for the representation of hidden layers of images and text, the loss function in the stacked corresponding auto-encoder is:
Figure BDA00030144840200000417
wherein:
Figure BDA0003014484020000051
Figure BDA0003014484020000052
Figure BDA0003014484020000053
Figure BDA0003014484020000054
and
Figure BDA0003014484020000055
representing the reconstruction errors of the image and text self-encoders,
Figure BDA0003014484020000056
representing the associated constrained error of the image and text,
Figure BDA0003014484020000057
is a representation in a hidden layer of the jth layer of the image in a stacked self-encoder,
Figure BDA0003014484020000058
is a representation in the jth layer reconstruction layer of an image in a stacked self-encoder,
Figure BDA0003014484020000059
is a representation in a hidden layer at the jth layer of text in a stacked self-encoder,
Figure BDA00030144840200000510
is a representation in a jth layer reconstruction layer of text in a stacked self-encoder; theta represents all parameters of the j-th layer in the stacked self-encoder;
the objective function of the integral adjustment stack type corresponding self-encoder is as follows:
Figure BDA00030144840200000511
x 0 and y 0 Input feature vector, x, for images and text 2h And y 2h For their corresponding reconstructed feature representations, δ (q) is L for all parameters in the stacked corresponding auto-encoder 2 And (4) regularizing the representation.
Respectively calculating the central points of positive feedback and negative feedback clustering in the user history record by using a K-means method, wherein the specific process is as follows:
acquiring a history record of a user, wherein the history record comprises positive feedback and negative feedback records;
extracting feature representations of the positive feedback and negative feedback data using an MMDNN model;
respectively calculating the distance between the feature representations of the positive feedback data and the negative feedback data by using Euclidean distance;
and respectively calculating by using a K-means method to obtain the central points of positive feedback and negative feedback clustering in the user record.
And when the positive feedback score and the negative feedback score of each historical record in the M items are respectively calculated by using the clustering center points of the positive feedback and the negative feedback, calculating the distance between the characteristic of the alternative picture or the text data and the positive feedback center and the negative feedback center of the user, and using the sum of the reciprocal of the distance between the characteristic of the image or the text data and the clustering center points of the positive feedback and the negative feedback as the positive feedback score and the negative feedback score of the data.
When the total score of each item of data in the M items of history records is calculated, the similarity score, the positive feedback score and the negative feedback score are combined in a weighting mode to be used as the total score of the image and the text data, and a specific weighting formula is as follows:
Figure BDA0003014484020000061
S i is the total score of the image or text,
Figure BDA0003014484020000062
is a similarity score for an image or text,
Figure BDA0003014484020000063
and positive and negative feedback scores of the image or the text are obtained, alpha is the weight of the image or the text, and i represents the ith candidate picture.
A computer device comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and when the processor executes part or all of the computer executable program, the multi-modal graph-text recommendation method can be realized.
A computer-readable storage medium, in which a computer program is stored, which, when executed by a processor, is capable of implementing the multi-modal teletext recommendation method of the invention.
Compared with the prior art, the invention has at least the following beneficial effects:
according to the requirements of users for multi-modal combined information, the invention provides a cross-modal recommendation method, the combined image and text information resource which is possibly interested by the user can be recommended to the user according to the interest preference of the user, the invention is a recommendation method for recommending multi-modal combined information to the user for the first time, in order to realize the function of recommending multi-modal combined information, the invention adopts a cross-modal retrieval model which has the advantages that compared with the traditional cross-modal retrieval model, meanwhile, the invention has the advantages of higher model training speed and higher retrieval precision, also considers the positive feedback and negative feedback information of the user to the system recommended resources, combines the positive feedback and negative feedback information of the user with the cross-mode retrieval model, and multi-mode combined information resources are recommended to the user according to the interest preference of the user, so that multi-mode resource recommendation in different forms of pictures and texts is realized. Meanwhile, the invention can also timely update the clustering center of the positive feedback and the negative feedback in the invention according to the real-time positive feedback and negative feedback information of the user, thereby realizing the function that the recommended content changes according to the change of the interest and hobbies of the user. The invention improves the quality and efficiency of the cross-modal retrieval method, realizes the application of the cross-modal retrieval in the field of recommendation systems, does not depend on excessive historical data, and effectively improves the efficiency and accuracy of recommendation.
Furthermore, the MobileNet 3-large model has the characteristics of high precision, high speed and the like in the aspect of extracting image features and is used for extracting the image features; the TF-IDF algorithm is used in a text classification task and can consider the weights of different words in the text; the linear and non-linear relationships between data can be efficiently extracted for representation DAE within the text modality.
Drawings
FIG. 1 is a schematic diagram of a recommendation method that may be implemented according to the present invention.
Fig. 2 is a schematic diagram of a novel cross-modal search method.
Detailed Description
The technical scheme of the invention is further explained by combining the drawings and the examples.
Using a cross-mode image-text retrieval model MMDNN (Multimodal Deep Neural network) to use the MMDNN in a recommendation system, and calculating a positive and Negative Feedback clustering Center of a user by using a positive and Negative Feedback clustering Center Calculation module PNFCCCM (Positive and Negative Feedback Cluster Center Calculation module) and a positive and Negative Feedback historical record of the user; and combining the similarity score and the positive and negative feedback scores of the data to find out the data with the highest comprehensive score in the user history record from the database. And finding out data of another modality corresponding to the data from the database by using the MMDNN model. And finally, recommending the paired graph-text resources to the user, and updating the historical record of the user and the positive and negative feedback clustering center of the user according to the feedback of the user to realize multi-mode graph-text recommendation.
Referring to fig. 1, a deep learning based multi-modal teletext recommendation method comprises the following steps:
referring to fig. 2, step 1, the cross modal search model MMDNN is trained using the Wikipedia dataset.
Step 1.1, extracting the characteristics of the pictures by using the MobileNet V3, and for the image characteristic extraction, preprocessing the images, converting black and white pictures into three-channel color pictures, and unifying the left and right pictures into a size of 224 × 224 to obtain an image set.
And dividing the image set into a training set and a testing set, sending the training set and the testing set into a MobileNet 3-large model to execute a classification task, and stopping training when the accuracy of the testing set reaches the highest.
MobileNet with the last classification layer removedV3-large extracts picture features to finally obtain 1280-dimensional preliminary image feature representation I m . In a Multimodal DBN module, the dimension of the top layer is 1024 dimensions, so the invention uses the preliminary image feature I of 1280 dimensions m Dimension reduction is carried out through a layer AE (AutoEncoder) so that the dimension of an output is 1024 dimensions, and the functions are as follows:
h=f(x)=l f (w f x+b h )
r=g(h)=l g (w g x+b r )
L(x,r)=||r-x|| 2
h represents a hidden layer, r represents a reconstruction layer, f (x) and g (h) are activation functions, and sigmoid activation functions are used in the invention; w is a f And w g Is a weight; b h And b r Is an offset; l (x, r) is a reconstruction error function. AE in the invention is trained by minimizing an error function, and the final output 1024-dimensional image feature is represented as I a
Step 1.2, obtaining a text preliminary representation by using a TF-IDF algorithm, which is specifically as follows:
firstly, removing stop words in a text by using an NLTK tool, then calculating a TF-IDF value of each word in each text by using a TF-IDF algorithm, then arranging the words in each document according to the TF-IDF value in a reverse order, and selecting the TF-IDF value of the first 3000 words as the primary representation of the text; counting words in all documents, and coding each document according to a uniform word sequence; 3000 words in each document are represented by TF-IDF values of the words at positions corresponding to the total vocabulary, and the remaining positions are filled with 0.
The final dimensionality of the text preliminary representation extracted by the TF-IDF is too large, so that the dimensionality is reduced to 3000 dimensionality by using a PCA algorithm, and the text after dimensionality reduction is represented as T p
Step 1.3, using DAE to extract representation in text mode, specifically:
the input of DAE is a preliminary representation T after PCA dimension reduction p Its hidden layer is set to 2 layers with dimensions of 2048 and 1024 respectively.
As an example, the present invention trains the DAE by minimizing an objective function, which is as follows:
Figure BDA0003014484020000091
wherein L is r (x,x 2h ) For its reconstruction error, w e And w d Is the weight of the encoder and decoder, p is the p-th hidden layer, h is the number of hidden layers,
Figure BDA0003014484020000092
is L 2 Regularization with final output as a representation T within the text modality d
Step 1.4, extracting the representation between the image and the text modality by using Multimodal DBN (Deep Belief Network). The Multimodal DBN model has two inputs, namely an image representation and a text representation, wherein the image representation is input as the preliminary image feature I of 1280 dimensions described in step 1.1 m And (4) enabling the dimension to pass through a layer of RBM, reducing the dimension to 1024, wherein an activation function used in dimension reduction is a sigmoid activation function.
The input of the text representation is the reduced-dimension text representation T p Then the reduced text is represented by T p Inputting an RSRBM model; RSRBMs are often used to process data that is discrete in value, with an energy function expressed as:
Figure BDA0003014484020000093
wherein v is i Is the value of the ith node of the input layer, h j Value of j-th node of hidden layer, w ij As a weight between the input layer and the hidden layer, b i Is the bias of the ith node of the ith input layer, a j Is the bias of the jth node of the hidden layer and m is the sum of the discrete values of the visible layer.
Taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the node numbers of the two hidden layers are 2048 and 1024 respectively, and the activation function is set as a sigmoid activation function. Then, at the joint layer of the Multimodal DBN, alternate gibbs sampling is performed using the following formula.
Figure BDA0003014484020000094
Figure BDA0003014484020000095
σ(x)=1/(1+e -x )
Figure BDA0003014484020000096
And
Figure BDA0003014484020000097
for generating a distribution over the data of each modality,
Figure BDA0003014484020000098
is a layer 1 hidden layer of the image input, sigma () is a sigmoid activation function,
Figure BDA0003014484020000099
is the weight on layer 2 of the image, a t For the offset of the last layer of the text,
Figure BDA00030144840200000910
a layer 2 hidden layer for text input,
Figure BDA0003014484020000101
as a weight on the layer 2 hidden layer of the text, a i The offset of the last layer of the image, x is the input of the activation function, and e is a natural number; the final output of the Multimodal DBN is two: inter-image-modality representation with textual information and inter-text-modality representation with image information, i.e. one is an inter-image-modality representation Y with textual features i One is a text inter-modality representation Y with image features t
Step 1.5, the intra-modality representation and the inter-modality representation of each modality are fused by two join-RBMs.
The input of the image join-RBM is I a And Y i The output is I 0 (ii) a The input of the text join-RBM is T d And Y t The output is T 0 (ii) a The image join-RBM model fuses the image modality internal representation and the image modality inter-representation to obtain the comprehensive representation I of the second stage image 0 (ii) a The text join-RBM model fuses the representation in the text mode and the text mode to obtain the comprehensive representation T of the text in the first stage 0
Step 1.6, two DAEs are used for I respectively 0 And T 0 Performing classification training with I 0 And T 0 As an embodiment, the present invention determines the number of the optimal hidden layers to be 3, and the number of nodes of each hidden layer is set to be 512, 256, and 64, respectively.
And step 1.7, corresponding the hidden layers of the image and the hidden layers of the text one by one to form a new self-encoder, namely a stack type corresponding self-encoder.
Step 1.8, applying an association constraint function in the stack-type corresponding self-encoder, training the stack-type corresponding self-encoder layer by layer from bottom to top by minimizing an objective function, and then adjusting all self-encoders on the whole, so that the stack-type corresponding self-encoder can establish a relation between the representations of the images and the texts while obtaining the final representations of the images and the texts.
Taking the association constraint function of the j-th layer as an example:
Figure BDA0003014484020000102
wherein the content of the first and second substances,
Figure BDA0003014484020000103
and
Figure BDA0003014484020000104
the input of the image and the text is performed,
Figure BDA0003014484020000105
and
Figure BDA0003014484020000106
parameters representing the image and the text are shown,
Figure BDA0003014484020000107
and
Figure BDA0003014484020000108
for the representation of hidden layers of images and text, finally, the loss function of the j-th layer of the SCAE can be expressed as:
Figure BDA0003014484020000109
the correlation formula in the above formula is as follows:
Figure BDA0003014484020000111
Figure BDA0003014484020000112
and
Figure BDA0003014484020000113
representing the reconstruction errors of the image and text self-encoders,
Figure BDA0003014484020000114
representing the associated constraint error of the image and text.
The objective function at the global adjustment stage is:
Figure BDA0003014484020000115
x 0 and y 0 Input feature vector, x, for images and text 2h And y 2h For their corresponding reconstructed feature representations, δ (q) is L for all parameters in the stacked corresponding auto-encoder 2 And (4) regularizing the representation.
The method and the device calculate the central point of the positive and negative feedback clusters of the user by using the trained cross-modal retrieval model MMDNN based on the historical records of the user.
Step 2.1, firstly, obtaining the historical record of the user, wherein the historical record comprises the name of the user, the browsing record and the score of the user, the full score is 5, the score of 3 and above is regarded as positive feedback, and the score of 3 and below is regarded as negative feedback. The emphasis points of users at different stages on image or text resources are different, and if the users are more emphasized on pictures, image and text resources are recommended to the users according to picture records of the users; and if the user focuses more on the text resources, recommending the image-text resources to the user according to the text records browsed by the user.
As an example, the following is a process of recommending the image-text resource to the user according to the picture record of the user, and the process of recommending the image-text resource to the user according to the text record of the user is similar to this.
And 2.2, the interests and hobbies of the user change along with time, and the positive and negative feedback clustering center points of the user are calculated according to the historical records of the latest 50 pictures of the user. Less than 50 sheets were taken out completely.
The invention obtains 64-dimensional final representation of 50 pictures of a person by using MMDNN, the distance between any two pictures is calculated by Euclidean distance, and the formula is as follows:
Figure BDA0003014484020000116
wherein, I f1 And I f2 The final representation of the two pictures respectively.
Then, the positive and negative feedback central points recorded by the user are respectively calculated by using a K-means algorithm, wherein the K value is 1.
Step 3, as an example, the invention selects the first 20 records from the positive feedback records of the user, and takes all the records with less than 20. And make recommendations to the user in accordance therewith.
Step 4, extracting the final representation of 20 pictures through the MMDNN model, wherein the final representation of 20 pictures (I) f1 ,I f2 ,I f3 ……I f18 ,I f19 ,I f20 ) And find the category to which the 20 final representations belong. For convenience of description herein, it is assumed that 20 final representations belong to the category of "landscape".
And 5, processing all pictures in the 'landscape' sub-database in the picture database (but not including the pictures browsed by the user) by MMDNN (media distance network) to obtain final representation (I) of all pictures r1 ,I r2 ,I r3 ,I r4 ,I r5 ……)。
And 6, calculating the similarity scores of the final representation of each picture obtained in the step 5 and the 20 records of the user. As an example of an implementation, the present invention uses cosine similarity, and the similarity score is expressed as follows:
Figure BDA0003014484020000121
wherein the content of the first and second substances,
Figure BDA0003014484020000122
representative Picture I ri The score of the degree of similarity of (a),
Figure BDA0003014484020000123
is picture I ri And I fj The calculation formula of the similarity score is as follows:
Figure BDA0003014484020000124
wherein A and B are two n-dimensional vectors, A i And B i Respectively represent the ith characteristics of a vector A and a vector B, wherein A is a picture I ri B is picture I fj Calculating the similarityAfter scoring, the similarity scores are arranged in the reverse order, and the top M pictures are taken as alternatives. As an example, M is set to 10 in the present invention.
Step 7, calculating the distance between each of the 10 candidate pictures and the positive and negative feedback clustering center of the user, and using the sum of reciprocal distances from each picture to the positive feedback center and the negative feedback center as the positive and negative feedback score of the picture, wherein the correlation formula is as follows:
Figure BDA0003014484020000125
wherein, I ri And representing the ith alternative picture, wherein X is a positive feedback clustering central point, and Y is a negative feedback clustering central point.
Figure BDA0003014484020000126
And (4) scoring positive and negative feedback of the ith candidate picture.
Step 8, calculating the total score of each alternative picture by combining the similarity score and the positive and negative feedback score of each alternative picture, arranging the total scores in a reverse order, and taking the first K items as the last alternative recommended picture; as an example, in the present invention, K is set to 5, and the similarity score, the positive feedback score and the negative feedback score are combined in a weighted manner, and the specific formula is as follows:
Figure BDA0003014484020000131
wherein S is i Is the final score of the ith picture, α is the weight of the picture similarity score, and i represents the ith candidate picture.
Step 9, for each picture in the obtained alternative recommended pictures, finding a text resource corresponding to the picture from a text database by using an MMDNN model;
and step 10, combining the obtained alternative recommended pictures and the corresponding texts thereof to form 5 pairs of picture-text resources, namely forming a result recommended to the user.
The invention also updates the history record and the positive and negative feedback central point of the user according to the feedback of the user.
As an alternative embodiment, the present invention further provides a computer device, which includes a processor and a memory, where the memory is used to store a computer-executable program, the processor reads part or all of the computer-executable program from the memory and executes the computer-executable program, and when the processor executes part or all of the computer-executable program, the processor can implement part or all of the steps of the multimodal teletext recommendation method according to the present invention, and the memory is further used to store a history of a user.
A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, is capable of implementing the multimodal teletext recommendation method of the invention.
The computer equipment can adopt a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.
The invention also provides an output device for outputting the prediction result, wherein the output device is linked with the output end of the processor, and the output device is a display or a printer.
The processor of the present invention may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC) or a ready-made programmable gate array (FPGA).
The memory of the present invention may be an internal storage unit of a laptop, a tablet computer, a desktop computer, a mobile phone or a workstation, such as a memory, a hard disk: external memory units such as removable hard disks, flash memory cards may also be used.
Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), solid State drive (ssd), or optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a dynamic Random Access Memory (dram).

Claims (8)

1. A multi-modal graph-text recommendation method based on deep learning is characterized by comprising the following steps:
calculating a clustering center point of positive feedback and negative feedback of a user by adopting a cross-modal retrieval model based on a historical record of the user, wherein the historical record comprises an image and a text;
selecting the first N historical records with higher user scores from the user historical records;
extracting the characteristics of the N historical records, and obtaining the categories of the N historical records according to the characteristics;
extracting the data of the same type from the database with the same history recording modes by using a cross-mode retrieval model;
calculating similarity scores of the extracted data of the same type and the N historical records, arranging the similarity scores according to a reverse order, and selecting the historical records corresponding to the previous M similarity scores;
respectively calculating the positive feedback score and the negative feedback score of each history record in the M items by using the clustering center points of the positive feedback and the negative feedback;
calculating the total score of each data in the M historical records according to the similarity score of each of the M similarity scores, the positive feedback score and the negative feedback score, arranging the total scores in a reverse order, and selecting the first K data;
for each item of data in the K data, finding out K data corresponding to the item of data from a text database or an image database by using a cross-modal retrieval model;
correspondingly combining the first K data with K data in a text database or an image database to form K image-text pairs, namely obtaining a recommendation result; the cross-modal retrieval model is obtained by training through the following processes:
the method comprises the steps of firstly extracting image features by adopting a MobileNet 3-large model with the last classification layer removed, and on the basis of the initial extraction of the image features, on one hand, extracting representation in an image modality by using AE (extraction algorithm), namely the representation in the image modality with intra-modality information; on the one hand, RBM extraction is used to obtain a further representation of the image, which is to be used to form an inter-modality representation of the image with textual information;
preliminarily extracting text features by using a TF-IDF algorithm; on the one hand, on the basis of the preliminary extraction of text features, DAE is used for extracting representations in a text mode, namely the representations in the text mode with intra-mode information; using RSRBM extraction on the one hand to derive a text further representation to be used to form a text inter-modality representation with image information;
based on the further representation of the image and the further representation of the text, the invention extracts an inter-modal representation of the image and the text with a Multimodal DBN; performing Gibbs sampling alternately between the image and the text representation at the top layer of the Multimodal DBN, namely obtaining an inter-image modality representation with text characteristics and an inter-text modality representation with image characteristics;
the intra-modality representation and the inter-modality representation of each modality are fused using two joint-RBM models,
a join-RBM model fuses the representation in the image mode and the representation between the image modes to obtain the comprehensive representation of the image; the other join-RBM model fuses the representation in the text mode and the representation between the text modes to obtain the comprehensive representation of the text;
respectively carrying out classification training on the comprehensive representation of the image and the comprehensive representation of the text by using two DAEs so as to extract the optimal hidden layer number of the image and text characteristics;
fixing the optimal hidden layers of the extracted images and texts, and aligning the optimal hidden layers of the images and the texts one by one to form a stack type corresponding self-encoder;
in the stack type corresponding self-encoder, an association constraint function is used, the comprehensive representation of the second-stage image and the comprehensive representation of the second-stage text are reused to train the stack type corresponding self-encoder, so that the stack type corresponding self-encoder can establish a relation between the representations of the image and the text while obtaining the final representation of the image and the text;
when the clustering center points of the positive feedback and the negative feedback are used for respectively calculating the positive feedback score and the negative feedback score of each historical record in the M items: and calculating the distance between the alternative picture or text data characteristic and the positive feedback center and the negative feedback center of the user, and using the sum of the reciprocal of the distance between the image or text data characteristic and the positive and negative feedback cluster center point as the positive and negative feedback score of the data.
2. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein a cross-modal search model is used for data feature extraction, and the process of cross-modal search model training is divided into two stages:
in a first stage, for an image, extracting representations within an image modality and representations between image modalities with text information; for text, extracting representations in a text mode and representations between the text modes with image information;
in a second phase, combining the representation within the image modality and the representation between the image modalities to form a composite representation of the image; meanwhile, the representation in the text mode and the representation between the text modes are combined to form a text comprehensive representation, then a stack type corresponding self-encoder and a constraint function are utilized to establish the relation between the comprehensive representation of the image and the text, and the final representation of the image and the text is learned.
3. The deep learning based Multimodal teletext recommendation method according to claim 1, wherein when a Multimodal DBN is used to extract the representation between image and text modalities: firstly, inputting the preliminary representation of the text into an RSRBM model, wherein an RSRBM energy function is as follows:
Figure FDA0003763833910000031
wherein v is i Is the value of the ith node of the input layer, h j Value of j-th node of hidden layer, w ij As a weight between the input layer and the hidden layer, b i Is the bias of the ith node of the ith input layer, a j Is the bias of the jth node of the hidden layer, m is the sum of the discrete values of the visible layer;
taking the output of the RSRBM model as the text input of a Multimodal DBN, processing the text input through two hidden layers, wherein the number of nodes of the two hidden layers is 2048 and 1024 respectively, and an activation function is set as a sigmoid activation function; then, at the union layer of Multimodal DBNs, alternating gibbs sampling is performed using the following formula, obtaining a characterization with inter-modal information,
Figure FDA0003763833910000032
Figure FDA0003763833910000033
σ(x)=1/(1+e -x )
Figure FDA0003763833910000034
and
Figure FDA0003763833910000035
for generating a distribution over the data of each modality,
Figure FDA0003763833910000036
is a layer 1 hidden layer of the image input, sigma () is a sigmoid activation function,
Figure FDA0003763833910000037
is a weight on layer 2 of the image, a t For the offset of the last layer of text,
Figure FDA0003763833910000038
a layer 2 hidden layer for text input,
Figure FDA0003763833910000039
as a weight on the layer 2 hidden layer of the text, a i For the bias of the last layer of the image, x is the input to the activation function and e is a natural number.
4. The deep learning based multimodal teletext recommendation method according to claim 1, wherein said association constraint function is:
Figure FDA0003763833910000041
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003763833910000042
and
Figure FDA0003763833910000043
the input of the image and the text is performed,
Figure FDA0003763833910000044
and
Figure FDA0003763833910000045
parameters representing the image and the text are shown,
Figure FDA0003763833910000046
and
Figure FDA0003763833910000047
for the representation of hidden layers of images and text, the loss function in the stacked corresponding self-encoder is:
Figure FDA0003763833910000048
wherein:
Figure FDA0003763833910000049
Figure FDA00037638339100000410
Figure FDA00037638339100000411
Figure FDA00037638339100000412
and
Figure FDA00037638339100000413
representing the reconstruction errors of the image and text self-encoders,
Figure FDA00037638339100000414
representing the associated constrained error of the image and text,
Figure FDA00037638339100000415
is a representation in a hidden layer of the jth layer of the image in a stacked self-encoder,
Figure FDA00037638339100000416
is a representation in the jth layer reconstruction layer of the image in a stacked self-encoder,
Figure FDA00037638339100000417
is a representation in a hidden layer at the jth layer of text in a stacked self-encoder,
Figure FDA00037638339100000418
is a representation in the jth layer reconstruction layer of text in a stacked self-encoder; theta represents all parameters of the j-th layer in the stacked self-encoder;
the objective function of the integral adjustment stack type corresponding self-encoder is as follows:
Figure FDA00037638339100000419
x 0 and y 0 Input feature vector, x, for images and text 2h And y 2h For their corresponding reconstructed feature representations, δ (q) is L for all parameters in the stacked corresponding auto-encoder 2 And (4) regularizing the representation.
5. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein the central points of positive feedback and negative feedback clustering in the user history record are calculated by a K-means method respectively, and the specific process is as follows:
acquiring a history record of a user, wherein the history record comprises a positive feedback record and a negative feedback record;
extracting feature representation of the positive feedback and negative feedback data by using a cross-modal graph-text retrieval model;
respectively calculating the distance between the characteristic representations of the positive feedback data and the negative feedback data by using Euclidean distance;
and respectively calculating by using a K-means method to obtain the central points of positive feedback and negative feedback clustering in the user record.
6. The deep learning-based multimodal teletext recommendation method according to claim 1, wherein when the total score of each item of data in the M items of history records is calculated, the similarity score, the positive feedback score and the negative feedback score are combined in a weighting manner to serve as the total score of the image and text data, and a specific weighting formula is as follows:
Figure FDA0003763833910000051
S i is the total score of the image or text,
Figure FDA0003763833910000052
is a similarity score for an image or text,
Figure FDA0003763833910000053
and positive and negative feedback scores of the image or the text are obtained, alpha is the weight of the image or the text, and i represents the ith candidate picture.
7. A computer device comprising a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can realize the multi-modal graph-text recommendation method according to any one of claims 1-6 when executing part or all of the computer executable program.
8. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of multimodal teletext recommendation according to any one of claims 1-6.
CN202110385246.5A 2021-04-09 2021-04-09 Multi-mode image-text recommendation method and device based on deep learning Active CN113094534B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110385246.5A CN113094534B (en) 2021-04-09 2021-04-09 Multi-mode image-text recommendation method and device based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110385246.5A CN113094534B (en) 2021-04-09 2021-04-09 Multi-mode image-text recommendation method and device based on deep learning

Publications (2)

Publication Number Publication Date
CN113094534A CN113094534A (en) 2021-07-09
CN113094534B true CN113094534B (en) 2022-09-02

Family

ID=76676034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110385246.5A Active CN113094534B (en) 2021-04-09 2021-04-09 Multi-mode image-text recommendation method and device based on deep learning

Country Status (1)

Country Link
CN (1) CN113094534B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462554B (en) * 2022-04-13 2022-07-05 华南理工大学 Potential depression assessment system based on multi-mode width learning
CN114612749B (en) * 2022-04-20 2023-04-07 北京百度网讯科技有限公司 Neural network model training method and device, electronic device and medium
CN115964560B (en) * 2022-12-07 2023-10-27 南京擎盾信息科技有限公司 Information recommendation method and equipment based on multi-mode pre-training model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462485A (en) * 2014-12-18 2015-03-25 北京邮电大学 Cross-modal retrieval method based on corresponding deep-layer belief network
CN108595636A (en) * 2018-04-25 2018-09-28 复旦大学 The image search method of cartographical sketching based on depth cross-module state correlation study

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10459995B2 (en) * 2016-12-22 2019-10-29 Shutterstock, Inc. Search engine for processing image search queries in multiple languages
KR102387305B1 (en) * 2017-11-17 2022-04-29 삼성전자주식회사 Method and device for learning multimodal data
CN108647350A (en) * 2018-05-16 2018-10-12 中国人民解放军陆军工程大学 A kind of picture and text associative search method based on binary channels network
CN108876643A (en) * 2018-05-24 2018-11-23 北京工业大学 It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method
US11074253B2 (en) * 2018-11-02 2021-07-27 International Business Machines Corporation Method and system for supporting inductive reasoning queries over multi-modal data from relational databases
US20200311798A1 (en) * 2019-03-25 2020-10-01 Board Of Trustees Of The University Of Illinois Search engine use of neural network regressor for multi-modal item recommendations based on visual semantic embeddings
US11244205B2 (en) * 2019-03-29 2022-02-08 Microsoft Technology Licensing, Llc Generating multi modal image representation for an image
CN110457516A (en) * 2019-08-12 2019-11-15 桂林电子科技大学 A kind of cross-module state picture and text search method
CN112287166B (en) * 2020-09-23 2023-03-07 山东师范大学 Movie recommendation method and system based on improved deep belief network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462485A (en) * 2014-12-18 2015-03-25 北京邮电大学 Cross-modal retrieval method based on corresponding deep-layer belief network
CN108595636A (en) * 2018-04-25 2018-09-28 复旦大学 The image search method of cartographical sketching based on depth cross-module state correlation study

Also Published As

Publication number Publication date
CN113094534A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN113094534B (en) Multi-mode image-text recommendation method and device based on deep learning
Kaur et al. Comparative analysis on cross-modal information retrieval: A review
Dering et al. A convolutional neural network model for predicting a product's function, given its form
US20170200066A1 (en) Semantic Natural Language Vector Space
CN112241468A (en) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
GB2546360A (en) Image captioning with weak supervision
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
Li et al. MRMR-based ensemble pruning for facial expression recognition
CN111324765A (en) Fine-grained sketch image retrieval method based on depth cascade cross-modal correlation
CN108154156B (en) Image set classification method and device based on neural topic model
CN112015868A (en) Question-answering method based on knowledge graph completion
Peng et al. UMass at ImageCLEF Medical Visual Question Answering (Med-VQA) 2018 Task.
CN116204706A (en) Multi-mode content retrieval method and system for text content and image analysis
US11755668B1 (en) Apparatus and method of performance matching
Abdul-Rashid et al. Shrec’18 track: 2d image-based 3d scene retrieval
CN113297410A (en) Image retrieval method and device, computer equipment and storage medium
CN111461175A (en) Label recommendation model construction method and device of self-attention and cooperative attention mechanism
CN115221369A (en) Visual question-answer implementation method and visual question-answer inspection model-based method
CN112950414B (en) Legal text representation method based on decoupling legal elements
CN112084338B (en) Automatic document classification method, system, computer equipment and storage medium
Bibi et al. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval
CN116775798A (en) Cross-modal hash method based on feature fusion between graph network and modalities
US11810598B2 (en) Apparatus and method for automated video record generation
CN113297485B (en) Method for generating cross-modal representation vector and cross-modal recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant