CN108876643A - It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method - Google Patents

It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method Download PDF

Info

Publication number
CN108876643A
CN108876643A CN201810505633.6A CN201810505633A CN108876643A CN 108876643 A CN108876643 A CN 108876643A CN 201810505633 A CN201810505633 A CN 201810505633A CN 108876643 A CN108876643 A CN 108876643A
Authority
CN
China
Prior art keywords
layer
visible
input
hidden
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810505633.6A
Other languages
Chinese (zh)
Inventor
毋立芳
张岱
杨博文
简萌
刘海英
祁铭超
贾婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201810505633.6A priority Critical patent/CN108876643A/en
Publication of CN108876643A publication Critical patent/CN108876643A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

It is a kind of social activity plan exhibition network on acquire (Pin) Multimodal presentation method be related to smart media calculate and big data analysis technical field.For given acquisition, picture is after the pretreatment such as image scaling, image cropping;It is input in the convolutional neural networks (CNN) of a training on the image data set of automatic marking, after the completion of the propagated forward of CNN, extract middle layer activation value indicates as image;Each word in acquisition description completes trained word2vec on corpus by one and is mapped as term vector, and all term vectors obtain text representation through Chi Huahou;The expression of image and text representation both modalities which is input to together in the multi-modal depth Boltzmann machine that one is completed training, and the top layer activation probability of deduction indicates the multi-modal joint as acquisition;The data fusion of two kinds of picture, text different modalities is formd unified representation space by the present invention, and has reasonably handled missing values problem, is the highly effective multi-modal joint representation method of acquisition.

Description

A kind of social activity plan opens up the Multimodal presentation method of acquisition (Pin) on network
Technical field
The present invention relates to smart media calculating and big data analysis technical fields, open up network more particularly to a kind of social plan The Multimodal presentation method of upper acquisition (Pin).Social plan exhibition is indicated using the multi-modal informations such as picture, text more particularly to a kind of The method acquired in network.
Background technique
Along with the prevailing of social networks (Facebook, Twitter, microblogging etc.), social network introduces user more Relationship and interbehavior information between more Social behaviors data and members of society.In recent years, many social network sites joined " plan exhibition " function, " plan exhibition " are planned, are screened and show, social plan exhibition network allow user to article shown in its network into Row, which is collected, classifies, sharing, thumbing up, commenting on, giving a mark, pay close attention to etc., operates (as shown in Figure 1), and the information for making user autonomous carries out again Distribution, so that user independently expresses the hobby of oneself.Social plan exhibition network has aggravated between user compared with traditional network Interaction, user's expression way are more rich and varied.Different from traditional social networks --- such as based on information share microblogging, Twitter, the Facebook etc. based on social networks between user, social plan exhibition network are a kind of point of interest drivings by user, Social networks made of being established based on interest of the user to the article shown in network.To as Pinterest, petal net The research of plan exhibition network based on user interest, becomes one of hot spot in recent years.
Unlike traditional social networks, in social plan exhibition network, a small amount of basic user information, user are only existed Interacting between the article shown with website, occupies leading position.Acquisition (Pin) is article most basic in plan exhibition network Unit is described the information group of both different modalities by the text corresponding with picture being provided by user of a picture and one section At.His interested acquisition can be arranged recombination by user, be saved in different drawing boards (Board), as shown in Figure 1.This meaning Taste indicated with all drawing boards that the interest of a user can be possessed by him, and drawing board can be all by include in drawing board (Pin) is acquired to indicate, i.e., indicates completely express other nodes different in social plan exhibition network based on acquisition.Cause This, finds a kind of Multimodal presentation method effectively acquired, to user modeling, the personalized recommendation etc. in social plan exhibition network The research in field is all of great significance.
The retrieval of cross-module state is mainly directed towards based on multi-modal correlative study in recent years and is classified based on multi-modal data, finally It is retrieved or classification results, the joint for hardly resulting in both modalities which data indicates, a unified expression of space can not be formed, Ability to express is limited.And these applications carry out on a fixed data library, possess complete two modal data, text diagram As data correspond.But since internet data lacks problem, there are what text data lacked to show for the acquisition of 20-30% As this acquisition for being difficult to existing multi-modal correlative study in social plan exhibition network indicates.In addition, social plan opens up net What is naturally had the function of in network turns to adopt, and allowing different user according to user preference, (acquisition is most important to the same image Component part) different tag along sort, the data set for having label that therefore, it is difficult to suggest can be used for learning is this to turn to adopt by user The phenomenon that generation and conventional method one of reason hard to work.The above various reasons, so that existing scheme is to social plan exhibition The ability to express of acquisition in network is limited.
The present invention is based on social plans to open up website petal net, takes full advantage of the multi-modal data in petal net, and reasonable Missing values problem has been handled, the data fusion of two kinds of picture, text different modalities is formd into unified expression of space, is obtained One kind highly effective multi-modal joint representation method of acquisition for social plan exhibition network.
Summary of the invention
The object of the present invention is to provide Multimodal presentation method (its frame such as Fig. 1 acquired on a kind of social plan exhibition network It is shown).
1. a kind of acquisition (Pin) representation method based on multi-modal data, which is characterized in that include the following steps:
1), acquisition (Pin) multi-modal association list shows the building of learning framework
The multi-modal association list of acquisition based on deep learning shows that learning framework is as shown in Figure 2.Entire learning process can be divided into Three parts:Image expression, text representation and multi-modal fusion.For given acquisition, picture after pretreatment, is input to one In a convolutional neural networks (CNN) for completing fine tuning on the image data set that automatic marking is crossed, the propagated forward of CNN is completed Afterwards, extract specified middle layer activation value indicates as image;It is described after pretreatment, each word in description by one Trained word2vec is completed on corpus and is mapped as term vector, and all term vectors obtain text representation through Chi Huahou;Two kinds of moulds The expression of state is input to multi-modal depth Boltzmann machine (DBM, the deep Boltzmann for completing training together Machines in), the top layer activation probability of deduction indicates the multi-modal joint as acquisition.
2), image indicates
Picture is the core content of acquisition, is the most important carrier of user interest in social plan exhibition network, good image Indicate of great advantage for carrying out customer analysis in social plan exhibition network.Picture expression in social plan exhibition network not only should Extrinsic information containing picture should be able to also open up the user interest in network with social plan and establish certain relationship.Comprehensively consider Efficiency and performance, the present invention are chosen at the basic model that the AlexNet trained on ImageNet is practised as image table dendrography.
The key for finely tuning AlexNet is to establish the data set of an extensive high quality, and social plan is opened up network by the present invention On same picture is turned to adopt relationship and indicate with tree structure, referred to as turn to adopt tree.It counts a picture and is adopted on tree by turn and owned User assigns to the frequency in different classifications, uses all classification frequencies as multidimensional real number label.Petal net adds up to 33 classification, One picture is assigned to i-th of classification CiIn classification frequency beThe label of the picture is collectively formed by 33 classification frequencies.
AlexNet is initially more classification problems for 1000 class Mutex objects and designs, loss layer be it is flexible most (softmax) logarithm loss layer greatly.Since the property of label and the target of study are different with master mould, loss layer need to be replaced with S type function (sigmoid) cross entropy (cross entropy) loss layer, loss function are:
Wherein NCFor sample set,For frequency of classifyingCorresponding S type function output.
3), text representation
Description is the important content supplement of picture in acquisition, and for turning to adopt the different acquisition in tree, description is to discriminate between its use One of the main contents of family preference, good text representation is for especially its personalization of the customer analysis in social plan exhibition network It analyzes particularly significant.With image expression, text representation also implies the relationship with user interest in social plan exhibition network.This hair It is bright middle the mean value pond of term vector to be turned to text representation.For text T, it is ultimately expressed as
Wherein MTFor word number in text,For word WordiVector indicate.
4), multi-modal fusion
The picture and Textual information of acquisition are merged present invention uses multi-modal DBM, structure is as shown in figure 3, multimode The structure of state DBM is that a shared hidden layer is added at the top of two two layers of DBM, in addition to two visible layers, Hide All Layer is made of binary cells.Each DBM can be considered as by two limited Boltzmann machine (RBM, restricted Boltzmann machine) connected form is laminated.RBM is a kind of undirected bigraph (bipartite graph) model, that is to say, that visible layer and hide Connectionless, the two-way full connection of interlayer in the layer of layer.RBM is the model based on energy.Given visible layer V=(vi)∈{0,1}D, Middle viFor i-th of visible element in layer, D is the sum of visible element in layer.Given hidden layer H=(hj)∈{0,1}F, wherein hj For j-th of hidden unit, F is the sum of hidden unit in layer.Visible layer and hidden layer jointly define energy function
WhereinFor model parameter,For set of real numbers, wijIt is i-th Symmetrical between visible element and j-th of hidden unit interacts item, ai、bjRespectively i-th of visible element and j-th of hidden unit Bias term.Two layers of Joint Distribution obeys ANALOGY OF BOLTZMANN DISTRIBUTION, is defined as
Graph One factor is potential function, exp (x)=e after whereinxIt is the exponential function at bottom for natural constant, Z (θ) is partition letter Number, also known as normaliztion constant can be obtained by two layers of whole state computation
Actually Joint Distribution, which is equivalent to, seeks flexible maximum.It, can be by combining since unit is conditional sampling in layer Distribution obtains the activation probability of hidden layer after obtaining condition distribution and factorization
Wherein sigmoid (x)=1/ (1+e-x) it is S type function, it can be seen that activation probability expression is S with activation primitive The neural network neuron of type function is identical.The activation probability of visible layer solves and expression formula is same.The parameter optimization mesh of RBM Mark is to maximize log-likelihood function, and its essence is the maximum probability (often plus regularization term) for making currently to input distribution, logarithms The derivative of likelihood function be data correlation distribution expectation with the desired difference of model profile, can be construed to visible layer activation probability with Error originated from input is minimum.Since Multimodal presentation of the invention is real-valued vectors, bottom need to be changed to one of mutation of RBM --- Gaussian-Bernoulli RBM.By visible layer(vi、D、It is defined as above) and hidden layer H=(hj)∈{0, 1}FThe energy function that (j, F are defined as above) defines is
WhereinFor model parameter (wij、ai、bjDefinition is same On), σiFor the standard deviation of i-th of visible element.Its Joint Distribution defines constant, and partition function is changed to
The activation probability of hidden layer solves constant, it is seen that the activation probability of layer is changed to
WhereinIt is to be desired forVariance isNormal distribution, σiFor The standard deviation of i-th of visible element, wijSymmetrical between i-th of visible element and j-th of hidden unit interacts item, hjFor jth A hidden unit, aiFor the bias term of i-th of visible element.Advanced features, DBM can be extracted by the DBM that stacking RBM is formed Energy function definition, Joint Distribution definition, activation probability solve, optimization aim it is similar with RBM, only middle layer is suitable In the full articulamentum activated jointly by two adjacent layers, as shown in Figure 3.Finally, the Joint Distribution of the multi-modal DBM of the present invention is
(10)
Wherein θ is whole model parameters, the variance of symmetrical interactive item, every layer of bias term including interlayer and visible layer. VI、HI1、HI2The respectively visible layer of image access, the first hidden layer, the second hidden layer, VT、HT1、HT2Respectively text access Visible layer, the first hidden layer, the second hidden layer, H3For top hidden layer, overall structure is as shown in Figure 3.
Detailed description of the invention
Fig. 1 is the multi-stage user relation schematic diagram in social plan exhibition network;
Fig. 2 is method block diagram designed by the present invention;
Fig. 3 is multi-modal DBM structural schematic diagram used in the present invention;
Fig. 4 is the present invention for extracting CNN structure chart used in picture expression;
Fig. 5 is that result when method of the invention is used to acquire classification prediction compares figure;
Specific embodiment
It is an object of the present invention to provide a kind of acquisition (Pin) representation method based on multi-modal data, frame such as Fig. 1 institute Show.The present invention is described in further detail with specific example with reference to the accompanying drawing.
Steps are as follows for the realization of the invention:
1), image indicates
Picture is the core content of acquisition, is the most important carrier of user interest in social plan exhibition network, good image Indicate of great advantage for carrying out customer analysis in social plan exhibition network.Picture expression in social plan exhibition network not only should Extrinsic information containing picture should be able to also open up the user interest in network with social plan and establish certain relationship.Comprehensively consider Efficiency and performance, the present invention are chosen at the basic model that the AlexNet trained on ImageNet is practised as image table dendrography.
The key for finely tuning AlexNet is to establish the data set of an extensive high quality, and social plan is opened up network by the present invention On same picture is turned to adopt relationship and indicate with tree structure, referred to as turn to adopt tree.It counts a picture and is adopted on tree by turn and owned User assigns to the frequency in different classifications, uses all classification frequencies as multidimensional real number label.Petal net adds up to 33 classification, One picture is assigned to i-th of classification CiIn classification frequency beThe label of the picture is collectively formed by 33 classification frequencies.
AlexNet is initially more classification problems for 1000 class Mutex objects and designs, loss layer be it is flexible most Big logarithm loss layer.Since the property of label and the target of study are different with master mould, loss layer need to be replaced with to the friendship of S type function Entropy loss layer is pitched, loss function is:
WhereinFor frequency of classifyingCorresponding S type function output.The detailed construction of CNN model of the present invention is shown in Fig. 3.Its Middle ReLU is line rectification function, and also known as linear amending unit (rectified linear unit), LRN returns for local acknowledgement One changes (local response normalization).CNN model of the present invention is substantially the recurrence device an of multi-tag.Through Model parameter after crossing fine tuning will be saved for feature extraction.The activation value of full articulamentum 2 will be extracted as image expression.
Since AlexNet needs fixed input size, the present invention has carried out image scaling, figure when carrying out image preprocessing As cutting.It is 256 pixels that picture is scaled to shorter one side side length first, then will centered on the center of picture after scaling Picture is cut to 256 × 256 pixels.Data set is not used mirror image, rotation, colour switching etc. and carries out data enhancing, but is finely tuning Data set is expanded in a manner of 227 × 227 pixels in random cropping picture in the process.Picture is inputted before convolutional layer and is also carried out Mean value is gone to accelerate the convergence of loss function.
A little change is carried out to network herein according to the actual classification classification acquired in petal net, by 1000 classes of script Final classification classification has been changed to 33 classes, is run using deep learning frame Caffe, and GPU and CUDA that NVIDIA is utilized are real Existing parallelization accelerates operation.The model parameter of actual loaded of the present invention is the CaffeNet trained on ImageNet, should Model and AlexNet have two o'clock technicality:First is that not using principal component analysis (PCA, principal component Analysis data enhancing) is carried out;Second is that having exchanged pond layer and having normalized the sequence of layer.Complete 8 data size of articulamentum of model It is arranged to 33.Structure shares 6,000 ten thousand parameters and 650,000 as shown in figure 4, be made of 5 convolutional layers and 3 full articulamentums Neuron.CaffeNet shares eight layer networks, and structure is as follows:
Convolutional layer 1:Input 4 activation primitive ReLU of 227*227*3 convolution kernel 11*11*3 step-length
Convolutional layer 2:It inputs 27*27*96 convolution kernel 5*5*48 step-length 1 and fills 2 activation primitive ReLU
Convolutional layer 3:It inputs 13*13*256 convolution kernel 3*3*256 step-length 1 and fills 1 activation primitive ReLU
Convolutional layer 4:It inputs 13*13*384 convolution kernel 3*3*192 step-length 1 and fills 1 activation primitive ReLU
Convolutional layer 5:It inputs 13*13*384 convolution kernel 3*3*192 step-length 1 and fills 1 activation primitive ReLU
Full articulamentum 1:It inputs 6*6*256 and exports 0.5 activation primitive ReLU of 4096drop_out
Full articulamentum 2:4096 output 0.5 activation primitive ReLU of 4096drop_out of input
Full articulamentum 3:4096 output 33 of input
Since the general data layer of Caffe does not support multi-tag and real number label to input, the present invention has write one here A Python layers to read label.After the completion of model fine tuning, the activation value for extracting full articulamentum 7 is indicated as image.
2), text representation
Description is the important content supplement of picture in acquisition, and for turning to adopt the different acquisition in tree, description is to discriminate between its use One of the main contents of family preference, good text representation is for especially its personalization of the customer analysis in social plan exhibition network It analyzes particularly significant.With image expression, text representation also implies the relationship with user interest in social plan exhibition network.This hair It is bright middle the mean value pond of term vector to be turned to text representation.For text T, it is ultimately expressed as
WhereinFor word WordiVector indicate, MTFor word number in text.
Training includes the enwiki and zhwiki of Wikipedia dumps, the whole network in search dog laboratory with public data collection News data and Sohu's news data.Public data collection and description are natural languages, are required to carry out a large amount of text to locate in advance Reason could be used for machine learning.The Text Pretreatment that the present invention carries out includes complicated and simple conversion, removal punctuation mark, participle, removal Stop words, machine translation.Concrete operations process includes:
1, it in order to avoid the Chinese-traditional of same Chinese is mistakenly considered different texts by machine from simplified form of Chinese Character, uses respectively Open Chinese conversion (OpenCC, Open Chinese Convert), langconv packet and zhconv packet convert Chinese-traditional For simplified form of Chinese Character, the complicated and simple transformation result of OpenCC is had chosen after comparison;
2, it in view of punctuation mark is almost without semanteme or is difficult to learn its semantic information, is encoded using Unicode Table cooperation string packet and zhon packet establish punctuation mark table, write regular expression and are filtered out using re packet or regex packet Full-shape and half-angle punctuation mark;
3, in order to by Chinese segment for word2vec training, using the accurate model of jieba packet, THULAC packet, PyNLPIR packet carries out Chinese word segmentation, and the word segmentation result of jieba packet is had chosen after comparison;
4, in view of the semantic information of some high frequency words or name is difficult to learn, Chinese and English is established with high frequency words and name It deactivates vocabulary and has filtered out the word in deactivated vocabulary one by one;
5, in view of in natural language in addition to Chinese and English are there are also other multilinguals, with requests packet or The simulation of urllib2 packet manually using Google translation, Baidu's translation, there is the mode of translation on line to send request, from receiving Response in extract translation result, had chosen after comparison Google translation result.
Theme modeling tool packet gensim provides the module that can quickly train word2vec, passes through parameter setting, benefit Have trained a CBOW model with public data collection, term vector dimension is arranged to 300, the dimension be usually word2vec most Excellent dimension has ignored word frequency less than 5 word in training process and negative sampling, multi -CPU has been used to accelerate training speed.
To the term vector of description carry out mean value pond before, first to model obtain whole term vectors carried out PCA with The performance of text vector behind promotion mean value pond.The dimension of text vector is identical as term vector.
4), multi-modal fusion
The picture and Textual information of acquisition are merged present invention uses multi-modal DBM, structure is as shown in figure 3, multimode The structure of state DBM is that a shared hidden layer is added at the top of two two layers of DBM, in addition to two visible layers, Hide All Layer is made of binary cells.Each DBM can be considered as by two limited Boltzmann machines (RBM, Restrictedboltzmann machine) connected form is laminated.RBM is a kind of undirected bigraph (bipartite graph) model, that is to say, that can See connectionless, the two-way full connection of interlayer in the layer of layer and hidden layer.RBM is the model based on energy.Given visible layer V=(vi) ∈{0,1}D, wherein viFor i-th of visible element in layer, D is the sum of visible element in layer.Given hidden layer H=(hj)∈ {0,1}F, wherein hjFor j-th of hidden unit, F is the sum of hidden unit in layer.Visible layer and hidden layer jointly define energy Flow function
WhereinFor model parameter,For set of real numbers, wijIt can for i-th That sees between unit and j-th of hidden unit symmetrical interacts item, ai、bjRespectively i-th of visible element and j-th hidden unit Bias term.Two layers of Joint Distribution obeys ANALOGY OF BOLTZMANN DISTRIBUTION, is defined as
Graph One factor is potential function, exp (x)=e after whereinxIt is the exponential function at bottom for natural constant, Z (θ) is partition letter Number, also known as normaliztion constant can be obtained by two layers of whole state computation
Actually Joint Distribution, which is equivalent to, seeks flexible maximum.It, can be by combining since unit is conditional sampling in layer Distribution obtains the activation probability of hidden layer after obtaining condition distribution and factorization
Wherein sigmoid (x)=1/ (1+e-x) it is S type function, it can be seen that activation probability expression is S with activation primitive The neural network neuron of type function is identical.The activation probability of visible layer solves and expression formula is same.The parameter optimization mesh of RBM Mark is to maximize log-likelihood function, and its essence is the maximum probability (often plus regularization term) for making currently to input distribution, logarithms The derivative of likelihood function be data correlation distribution expectation with the desired difference of model profile, can be construed to visible layer activation probability with Error originated from input is minimum.Since Multimodal presentation of the invention is real-valued vectors, bottom need to be changed to one of mutation of RBM --- Gaussian-Bernoulli RBM.By visible layer(vi、D、It is defined as above) and hidden layer H=(hj)∈{0, 1}FThe energy function that (j, F are defined as above) defines is
WhereinFor model parameter (wij、ai、bjDefinition is same On), σiFor the standard deviation of i-th of visible element.Its Joint Distribution defines constant, and partition function is changed to
The activation probability of hidden layer solves constant, it is seen that the activation probability of layer is changed to
WhereinIt is to be desired forVariance isNormal distribution, σiFor The standard deviation of i-th of visible element, wijSymmetrical between i-th of visible element and j-th of hidden unit interacts item, hjFor jth A hidden unit, aiFor the bias term of i-th of visible element.Advanced features, DBM can be extracted by the DBM that stacking RBM is formed Energy function definition, Joint Distribution definition, activation probability solve, optimization aim it is similar with RBM, only middle layer is suitable In the full articulamentum activated jointly by two adjacent layers, as shown in Figure 3.Finally, the Joint Distribution of the multi-modal DBM of the present invention is
(10)
Wherein θ is whole model parameters, the variance of symmetrical interactive item, every layer of bias term including interlayer and visible layer. VI、HI1、HI2The respectively visible layer of image access, the first hidden layer, the second hidden layer, VT、HT1、HT2Respectively text access Visible layer, the first hidden layer, the second hidden layer, H3For top hidden layer, overall structure is as shown in Figure 3.
It is concentrated in real data, the acquisition in social plan exhibition network may not described, therefore multi-modal fusion model must The missing of single mode data must be capable of handling.For this problem, after the completion of model training, the present invention uses standard Gibbs Sampler (Gibbs sampler) alternating sampling realizes that missing text representation generates.Similarly, inferred with Gibbs sampler and pushed up Layer, and by top layer H3The multi-modal joint expression final as the present invention of activation probability.
The training of multi-modal DBM has used the GPU of NVIDIA and CUDA to accelerate operation.Hidden layer HI1、HT1、HT2Dimension Number is separately arranged as 4096,300,300 and shows lifting feature performance to greatest extent, H with respective mode input dimension oneI2、 H3Dimension be all set to 2048 with compressive features.All images indicate and text representation is sent into model in pairs and is successively used After the completion of contrast divergence algorithm pre-training, operation Gibbs sampler has inferred the text representation of missing, and is extracted H3Swash Probability living is indicated as the multi-modal joint finally acquired.
4, recommendation results are evaluated
The performance that acquisition indicates depends primarily on the energy for the user interest that its Prediction and Acquisition embodies on social plan exhibition network Power.Although the classification frequency representation of acquisition is actually conclusion of the most users to picture interest, and can be described reinforcement wherein The intensity of certain classification, but its approximation is regarded as the text that the interest of acquisition is distributed, and is acquired using it as label by the present invention This expression and multi-modal fusion indicate training and demonstrate multiple dimension logic to return (LR, logistic regressive), then with more Tie up the classification distribution of LR prediction test set.
Fig. 5 is that acquisition classification prediction result compares, it can be seen that image expression of the invention can greatly improve Main classification The accuracy rate of prediction, it is dry during model learning that this explanation can eliminate respective labels using frequency distribution as real value multi-tag It disturbs, and provides more information for model to learn.
On the other hand, the average non-zero position error of text representation prediction is larger, the reason of in addition to may be model itself, also It is likely due to only contain acquisition owner in description to the personal preference of picture, this individual's preference is picture institute energy Reflect a part of interest.It is suitable with the AlexNet finely tuned with Main classification to the predictive ability of Main classification, reflects from side Only picture is summarized with single classification, the interest information provided with describe and it is inadequate.
Although the estimated performance of text representation and the estimated performance gap that image indicates are larger, information contained therein is To effective supplement that image indicates, so that multi-modal fusion indicates that the estimated performance that image indicates can be better than in estimated performance While, dimension can also be reduced.
In conclusion practical text representation of the invention can reflect the user interest of acquisition carrying, figure to a certain extent As expression and multi-modal fusion indicate the interest distribution that then more can comprehensively reflect acquisition.Multi-modal joint is indicated in interest There is obvious advantage in analysis and data dimension.

Claims (5)

1. acquiring the Multimodal presentation method of (Pin) on a kind of social activity plan exhibition network, which is characterized in that include the following steps:
For given acquisition, image is after image scaling, image cropping pretreatment;It is input to a picture number in automatic marking According in the convolutional neural networks CNN trained on collection, after the completion of the propagated forward of CNN, complete 2 activation value of articulamentum is extracted as image It indicates;Each word in acquisition description completes trained word2vec on corpus by one and is mapped as term vector, all words Vector obtains text representation through Chi Huahou;The expression of image and text representation both modalities which is input to one together and completes training In multi-modal depth Boltzmann machine DBM, the top layer activation probability of deduction indicates the multi-modal joint as acquisition;
CNN includes that 5 convolutional layers and 3 full articulamentums, structure are as follows:
Convolutional layer 1:Input 227*227*3, convolution kernel 11*11*3
Convolutional layer 2:Input 27*27*96, convolution kernel 5*5*48
Convolutional layer 3:Input 13*13*256, convolution kernel 3*3*256
Convolutional layer 4:Input 13*13*384, convolution kernel 3*3*192
Convolutional layer 5:Input 13*13*384, convolution kernel 3*3*192
Full articulamentum 1:Input 6*6*256, output 4096
Full articulamentum 2:4096 output 4096 of input
Full articulamentum 3:4096 output 33 of input.
2. the method according to claim 1, wherein the expression of multi-modal joint is specially:
The structure of multi-modal DBM is that a shared hidden layer is added at the top of two two layers of DBM, in addition to two visible layers, Hide All layer is made of binary cells;Each DBM, which is considered as to fold to be connected by two limited Boltzmann machine layer RBM, to be formed;
Given visible layer V=(vi)∈{0,1}D, wherein viFor i-th of visible element in layer, D is the sum of visible element in layer; Given hidden layer H=(hj)∈{0,1}F, wherein hjFor j-th of hidden unit, F is the sum of hidden unit in layer;Visible layer and Hidden layer jointly defines the energy function of RBM
WhereinFor model parameter,For set of real numbers, wijFor i-th of visible list Member interacts item, a with symmetrical between j-th of hidden uniti、bjThe respectively biasing of i-th of visible element and j-th of hidden unit ?;Two layers of Joint Distribution obeys ANALOGY OF BOLTZMANN DISTRIBUTION, is defined as
Graph One factor is potential function, exp (x)=e after whereinxIt is the exponential function at bottom for natural constant, Z (θ) is partition function, again Claim normaliztion constant, is obtained by two layers of whole state computation
Actually Joint Distribution, which is equivalent to, seeks flexible maximum;Since unit is conditional sampling in layer, obtained by Joint Distribution The activation probability of hidden layer is obtained after condition distribution and factorization
Wherein sigmoid (x)=1/ (1+e-x) it is S type function, it will be seen that activation probability expression is S type function with activation primitive Neural network neuron is identical;The activation probability of visible layer solves and expression formula is same;The parameter optimization target of RBM is maximum Change log-likelihood function, its essence is the maximum probability for making currently to input distribution, the derivative of log-likelihood function is that data are related Distribution expectation and the desired difference of model profile, are construed to visible layer activation probability and error originated from input is minimum;
By visible layerAnd hidden layerDefine Gaussian-Bernoulli RBM's Energy function is
WhereinFor model parameter, σiFor i-th visible element Standard deviation;Its Joint Distribution defines constant, and partition function is changed to
The activation probability of hidden layer solves constant, it is seen that the activation probability of layer is changed to
WhereinIt is to be desired forVariance isNormal distribution, σiIt is i-th The standard deviation of a visible element, wijSymmetrical between i-th of visible element and j-th of hidden unit interacts item, hjIt is hidden for j-th Hide unit, aiFor the bias term of i-th of visible element;
The Joint Distribution of multi-modal DBM is
Wherein θ is whole model parameters, the variance of symmetrical interactive item, every layer of bias term including interlayer and visible layer;VI、 HI1、HI2The respectively visible layer of image access, the first hidden layer, the second hidden layer, VT、HT1、HT2Respectively text access can See layer, the first hidden layer, the second hidden layer, H3For top hidden layer.
3. using standard Gibbs the method according to claim 1, wherein being directed to the missing of single mode data Sampler alternating sampling realizes that missing text representation generates;Similarly, top layer is inferred with Gibbs sampler, and by top layer H3's Probability is activated to indicate as final multi-modal joint.
4. the method according to claim 1, wherein hidden layer HI1、HT1、HT2Dimension be separately arranged as 4096,300,300 show lifting feature performance to greatest extent, H with respective mode input dimension oneI2、H3Dimension set 2048 are set to compressive features;All images indicate and text representation is sent into model in pairs and successively use contrast divergence algorithm pre- After the completion of training, operation Gibbs sampler has inferred the text representation of missing, and is extracted H3Activation probability as most The multi-modal joint acquired eventually indicates.
5. the method according to claim 1, wherein being chosen at the AlexNet conduct trained on ImageNet The basic model that image table dendrography is practised;
Social plan is opened up, same picture is turned to adopt relationship and indicate with tree structure on network, referred to as turns to adopt tree;Statistics one Picture, which is turned to adopt, to be set upper all users and assigns to frequency in different classifications, uses all classification frequencies as multidimensional real number label;Always Classify in respect of 33, a picture is assigned to i-th of classification CiIn classification frequency beThe label of the picture is classified by 33 Frequency collectively forms;
Loss layer is replaced with into S type function cross entropy loss layer, loss function is:
Wherein NCFor sample set,For frequency of classifyingCorresponding S type function output.
CN201810505633.6A 2018-05-24 2018-05-24 It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method Pending CN108876643A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810505633.6A CN108876643A (en) 2018-05-24 2018-05-24 It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810505633.6A CN108876643A (en) 2018-05-24 2018-05-24 It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method

Publications (1)

Publication Number Publication Date
CN108876643A true CN108876643A (en) 2018-11-23

Family

ID=64333696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810505633.6A Pending CN108876643A (en) 2018-05-24 2018-05-24 It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method

Country Status (1)

Country Link
CN (1) CN108876643A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956142A (en) * 2019-12-03 2020-04-03 中国太平洋保险(集团)股份有限公司 Intelligent interactive training system
CN112396091A (en) * 2020-10-23 2021-02-23 西安电子科技大学 Social media image popularity prediction method, system, storage medium and application
CN113094534A (en) * 2021-04-09 2021-07-09 陕西师范大学 Multi-mode image-text recommendation method and device based on deep learning
CN114202038A (en) * 2022-02-16 2022-03-18 广州番禺职业技术学院 Crowdsourcing defect classification method based on DBM deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024056A (en) * 2010-12-15 2011-04-20 中国科学院自动化研究所 Computer aided newsmaker retrieval method based on multimedia analysis
CN105869058A (en) * 2016-04-21 2016-08-17 北京工业大学 Method for user portrait extraction based on multilayer latent variable model
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN106844442A (en) * 2016-12-16 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 Multi-modal Recognition with Recurrent Neural Network Image Description Methods based on FCN feature extractions
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
WO2017151757A1 (en) * 2016-03-01 2017-09-08 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Recurrent neural feedback model for automated image annotation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024056A (en) * 2010-12-15 2011-04-20 中国科学院自动化研究所 Computer aided newsmaker retrieval method based on multimedia analysis
WO2017151757A1 (en) * 2016-03-01 2017-09-08 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Recurrent neural feedback model for automated image annotation
CN105869058A (en) * 2016-04-21 2016-08-17 北京工业大学 Method for user portrait extraction based on multilayer latent variable model
CN106202413A (en) * 2016-07-11 2016-12-07 北京大学深圳研究生院 A kind of cross-media retrieval method
CN106844442A (en) * 2016-12-16 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 Multi-modal Recognition with Recurrent Neural Network Image Description Methods based on FCN feature extractions
CN106650756A (en) * 2016-12-28 2017-05-10 广东顺德中山大学卡内基梅隆大学国际联合研究院 Image text description method based on knowledge transfer multi-modal recurrent neural network
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956142A (en) * 2019-12-03 2020-04-03 中国太平洋保险(集团)股份有限公司 Intelligent interactive training system
CN112396091A (en) * 2020-10-23 2021-02-23 西安电子科技大学 Social media image popularity prediction method, system, storage medium and application
CN112396091B (en) * 2020-10-23 2024-02-09 西安电子科技大学 Social media image popularity prediction method, system, storage medium and application
CN113094534A (en) * 2021-04-09 2021-07-09 陕西师范大学 Multi-mode image-text recommendation method and device based on deep learning
CN114202038A (en) * 2022-02-16 2022-03-18 广州番禺职业技术学院 Crowdsourcing defect classification method based on DBM deep learning

Similar Documents

Publication Publication Date Title
CN108804530B (en) Subtitling areas of an image
CN108876643A (en) It is a kind of social activity plan exhibition network on acquire(Pin)Multimodal presentation method
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN104346440A (en) Neural-network-based cross-media Hash indexing method
CN110263325A (en) Chinese automatic word-cut
CN114461804B (en) Text classification method, classifier and system based on key information and dynamic routing
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
Wu et al. Combining contextual information by self-attention mechanism in convolutional neural networks for text classification
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
CN112528136A (en) Viewpoint label generation method and device, electronic equipment and storage medium
CN111080551A (en) Multi-label image completion method based on depth convolution characteristics and semantic neighbor
Uddin et al. Depression analysis of bangla social media data using gated recurrent neural network
CN112732872A (en) Biomedical text-oriented multi-label classification method based on subject attention mechanism
CN116127099A (en) Combined text enhanced table entity and type annotation method based on graph rolling network
Nie et al. Cross-domain semantic transfer from large-scale social media
Omurca et al. A document image classification system fusing deep and machine learning models
CN115906835B (en) Chinese question text representation learning method based on clustering and contrast learning
Meng et al. Regional bullying text recognition based on two-branch parallel neural networks
CN114298011B (en) Neural network, training method, aspect emotion analysis method, device and storage medium
CN111768214A (en) Product attribute prediction method, system, device and storage medium
CN113076468B (en) Nested event extraction method based on field pre-training
Guo Deep learning for visual understanding
Li et al. Short text sentiment analysis based on convolutional neural network
Rifai et al. Arabic Multi-label Text Classification of News Articles
CN109062995A (en) A kind of social activity plan opens up the personalized recommendation algorithm of drawing board (Board) cover on network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181123