CN106776881A - A kind of realm information commending system and method based on microblog - Google Patents
A kind of realm information commending system and method based on microblog Download PDFInfo
- Publication number
- CN106776881A CN106776881A CN201611075431.XA CN201611075431A CN106776881A CN 106776881 A CN106776881 A CN 106776881A CN 201611075431 A CN201611075431 A CN 201611075431A CN 106776881 A CN106776881 A CN 106776881A
- Authority
- CN
- China
- Prior art keywords
- user
- keyword
- microblogging
- module
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Abstract
The invention discloses a kind of realm information commending system based on microblog and method, including:Data acquisition and pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalized recommendation module and theme acquisition module;The present invention is directed to the characteristics design of microblog and realizes a kind of realm information and recommends method, keyword extraction and keyword expansion are carried out into seamless combination, so as to the extraction that both ensure that domain features in turn ensure that the dynamic of recommendation results, the experiment of Sina weibo is based on by correspondence system, the validity of this method is demonstrated.The present invention can the marketing of auxiliary enterprises microblogging, effectively improve the efficiency of enterprise microblogging marketing.
Description
Technical field
The present invention relates to a kind of field microblogging commending system under microblog and method, support that guideless field is special
Extraction and User Defined keyword are levied, belongs to field of computer technology.
Background technology
As internet enters the rear WEB2.0 epoch, social functions have become the model of internet great change.Major social activities
Website occurs and occupies the dominant position of internet rapidly like the mushrooms after rain, early in March, 2010, famous American social network
Stand FacebookTMJust U.S. maximum website is leapt in visit capacity more than Google.At home, the emerging social media such as Sina weibo
Also it is rapid to emerge, end on May 16th, 2012, the number of users of Sina weibo has reached 300,000,000, at home the use of Internet market
The Tencent QQ product accumulated by more than ten years is only second in the scale of family.On the other hand, according to the most authoritative IT researchs in the whole world and Gu
The consulting firm Gartner great strategy technical reports of IT industries in 2011 ten of issue are asked, the technology directly related with social activity is just
Two are account for, is respectively Social Communications and Collaboration and Social Analytics.It is also
Uniquely two technology major classes at seat are occupied in all technologies.Attention rate and its prospect of the people to social product as can be seen here.
On the other hand, people are used internet popularization and the enhancing that uses internet viscosity so that big data into
For the focus of IT circles concern in recent years.But, big data is converted into the value useful to the mankind, it is necessary to data mining etc.
The support of correlation technique.Therefore, data mining in recent years and the temperature of analysis are also to soar all the way.Especially enterprise-oriented number
According to excavating and analyzing, because it can bring direct interests for enterprise.And based on dividing that a large number of users True Data is produced
Analysis result has stronger reliability and convincingness compared to traditional analytical technology, and this is a kind of analysis side in real time
Method, can better adapt to turn of the market, preferably catch the market opportunity.
Although microblog contains various realm informations, and for field event reaction quickly, thereon
Obtain more comprehensive realm information and still face many difficulties.The rise of microblog and the rapid growth of user bring letter
Breath overload problem, the increase of quantity is paid close attention to user, and the content unrelated with field is also increasingly appearing in user's subscription
Microblogging in;Meanwhile, if user only focuses on a small amount of field associated user high, will cause obtain in time comprehensively
Realm information.I.e. data acquisition of the user in microblog has that accuracy rate can not get both with recall rate.And it is micro-
Rich platform also has theme dispersion and the characteristics of information fragmentation in itself, and this requires a kind of enterprise's microblog users that are capable of identify that
Field interest, and the method that micro-blog information is extracted and recommended according to domain correlation degree.However, existing social media pipe
Reason carries out simple word just with User Defined keyword mostly with analysis software in terms of realm information extraction
Match somebody with somebody, this method has very big defect.First, individual keywords can not comprehensively portray realm information demand;Secondly, language
It is rich so that simple characters matching effect is limited.
It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword
All there is respective defect in the method for interest.
Keyword extracting method is mainly manifested in:This is a kind of guideless extraction algorithm, and user cannot dynamically adjust calculation
Method result, therefore interim dynamic need of the user for certain field theme can not be met.And algorithm is in user's history microblogging
Field interest modeling can not be carried out in the case of less, i.e., so-called cold start-up phenomenon.
The defect of User Defined keyword is mainly manifested in:First, user-defined keyword is it is difficult to ensure that completely
Cover all information of this area;Second, in microblogging short text, judging that similitude will just with whether keyword occurs
So that many different information of synonymous but word cannot be extracted.
On the other hand, related research work is applied and respective problem is there is also in this application scene.First, social network
The rise of network is that data mining and data analysis correlation technique provide the new application scenarios being of great value, social networks
For data mining provides new visual angle with analysis, also cause that traditional data mining technology is faced with new challenges.
In sum, above method is applied in the reality system of microblog, be there are problems that following several big:
(1) keyword extraction and keyword expansion there are problems that respective, and pure strategy cannot well meet micro-
The demand that rich platform domain dependant information is recommended.
(2) in related research work, because the characteristics of information fragmentation, a class algorithm is in training language in social platform
Material aspect needs to be carried out by outside language material.This greatly reduces the practicality and portability of algorithm.
(3) another kind of algorithm is based on global language material carries out setting up model, and such algorithm calculates cost very greatly and without general
All over applicability.
Therefore a kind of recommendation method is needed, keyword can be based on, there is provided realm information personalized recommendation, to help user
Rapidly and accurately obtain domain dependant information.The method needs to solve enterprise customer in microblog and obtains domain dependant information mistake
The contradiction that accuracy rate can not get both with recall rate in journey.Requirement based on practicality simultaneously, the algorithm cannot rely upon outside language
Material, and need to be calculated based on user partial data.It is therefore proposed that a kind of recommendation method with above property, is this
The focus of invention.
The content of the invention
It is an object of the invention to:Overcome the deficiencies in the prior art, there is provided a kind of field based on microblog
Information recommendation system and method, the history microblogging based on user, it is proposed that keyword extraction is combined to build with keyword expansion
The method of mould user interest, not only ensure that comprehensive identification of realm information but also had allowed users to dynamically adjust oneself according to demand
Field interest;Using the keyword extraction algorithm TextRank based on figure, other language materials are not relied on, and avoid extraction
Result is influenceed by zipf law phenomenon present in language model, and proposes that a kind of P-IOW algorithms of optimization realize key
Word preferably extends.This method ensure that the dynamic interest demand of user can in real time be met and greatly enhance user
The expressive faculty of self-defined keyword.Carried out linearly according to User Defined weight with the result of extension by by keyword extraction
Merge, can be user-customized recommended association area micro-blog information and theme, help user rapidly and accurately to obtain field phase
Pass information.
The technology of the present invention solution:A kind of realm information commending system based on microblog, including:Data acquisition with
Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter
Calculate and personalized recommendation module and theme acquisition module;Wherein:
Data acquisition and pretreatment module:User's relevant microblog information data is obtained, and is pre-processed;Pretreatment includes
The stop words filtering of data, participle and part-of-speech tagging;Pre-processed results are the history microblog data of user, are transmitted to field crucial
Word extraction module;If User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined
Word expansion module;
Field keyword extracting module:Based on pre-processed results, keyword extraction is used based on TextRank algorithm modification
TextRank for Weibo algorithms without instruct carry out, the algorithm include based on cooccurrence relation non-directed graph construction and base
Two stages are calculated in the node weights of figure;The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging
The participle of middle appearance is converted into corresponding node;Between node connect side construction when, using whether have between node side and
Co-occurrence number of times of the weight on side by two words in same piece microblogging judges the composition of co-occurrence, and the weight on side is word same
Co-occurrence number of times in one microblogging, if two words co-occurrence in certain microblogging of user, node corresponding to two words it
Between the weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging;Then figure is based on again
Node weights calculation stages, iterate to calculate the weight in each stage, are until the variable quantity of node weights converges to certain threshold values
Only;After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to
Importance degree is ranked up the result for obtaining keyword extraction, so that the domain features where automatic identification user;
User Defined keyword expansion module:The attribute letter of the co-occurrence based on keyword, distribution and its owning user
Breath calculates the similarity between keyword, using the word of the degree of correlation high as target keyword spreading result;This module branch
The multiple self-defined keywords of user input are held, for each self-defined keyword, the extension term vector that can go out to keyword expansion
Carry out it is linear plus and, so as to obtain final spread vector;User Defined keyword expansion function ensure that the dynamic of user
Interest demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword;
Linear combining module:In field, keyword is automatically extracted and the extension based on User Defined keyword is completed
Afterwards, two result vectors are normalized using maximum method for normalizing, make the knot of keyword extraction and keyword expansion
Fruit DUAL PROBLEMS OF VECTOR MAPPING is among a unified span;After normalization, the vector after being normalized to two carries out linear combining,
Merging process supports the weight of User Defined keyword extraction and keyword expansion;It is final that module output one represents user
The term vector of field interest;
Relatedness computation and personalized recommendation module:Linear combining module depicts the crucial term vector of user field interest
Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered to generate word frequency vector, then by user interest keyword
The word frequency vector and IDF information vectors of microblogging generation vectorial, to be recommended carry out point multiplication operation, obtain the microblogging and user interest
The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging.By calculating the domain correlation degree of each user's microblogging,
It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, is realized micro- to the personalized field of user
It is rich to recommend;
Theme acquisition module:It is input training LDA models with the field microblogging text for recommending user, according to the word of theme
Lexical item is clustered into theme by item distribution;The user field interest keyword that will be obtained in theme lexical item set and linear combining module
Item carries out relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, thus complete motif discovery and
Recommend.
The data acquisition is as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first, after being verified, closed using the user automatically
The microblog voucher of connection is interacted with microblog, to verify legitimacy of the user identity in microblog;
(2) user's relevant microblog information data is obtained, the data structured ground persistence that will be obtained using local data base
Get up, to read at any time;
(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and part-of-speech tagging three
Point;For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then for microblogging
Chinese word segmentation and part-of-speech tagging that scape has been optimized, participle and part of speech mark are carried out using segmenter product I CTCLAS5.0
Note, while carrying out part of speech filtering to the result after user's microblogging participle before keyword extraction and keyword expansion, only protects
Leave behind a good reputation word.Pre-processed results are the history microblog data of user.
In the field keyword extracting module, the weight in each stage is calculated according to the algorithm idea iteration of PageRank
Calculate, formula is as follows:
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E
(Vi) it is ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start to change with arbitrary initial value
In generation, untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration to iteration
Less than specified numerical value.
In the User Defined keyword expansion module, the phase between keyword is calculated using improved P-IOW algorithms
Like spending, implementation process is as follows:
The computational methods of the User Defined domain correlation degree of keyword s, word t on s for giving are as follows:
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMIt is quilt in the correlation language material of field
The word that most a plurality of microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;wf(t∧
S) it is while the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that the smoothing factor that codomain is (0,1) is causedIt is not in situation that divisor is zero when being zero.Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P-
The codomain of IOLogW is in log (sp) arrive log (1/sp) between.The value of result of calculation P-IOW is bigger, declarer w and User Defined
Keyword s has field similarity higher.For User Defined keyword in itself, the upper of P-IOLogW codomains will be assigned
Limit, i.e. log (1/sp)。
In the User Defined keyword expansion module, the extension term vector gone out to keyword expansion linearly add
With so as to obtain during final spread vector, expansion process specific algorithm is as follows:
(1) to each User Defined keyword set keyword, first, all of keyword are calculated based on P-IOW
Expansion word weight;
(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed
Extension term vector;
(3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module
As a result.
The implementation process of the linear combining module is as follows:
(1) vector generated to keyword extraction and keyword expansion using maximum normalization method is normalized respectively
Treatment;
(2) for vector in each component, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector, returns by maximum
One change after, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting conjunction
And:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-extractIt is the result vector of keyword extraction,
Vkw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest
Vcombine。
In described Similarity Measure and personalized recommendation module, it is as follows that domain correlation degree calculates specific formula:
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Point
The IDF vectors of word are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith
The upper t of WiCorresponding component;IDF(ti) it is tiIDF values.
A kind of realm information based on microblog recommends method, is divided into data acquisition and is carried with pretreatment, field keyword
Take, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six steps,
Realize as follows:
(1) obtaining user's relevant microblog information carries out data prediction;Pretreatment work is included using the side of pattern match
Method is filtered to stop words;Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0;Pre-processed results are use
The history microblog data at family.
(2) pre-processed results carry out field keyword extraction, and keyword extraction will be calculated using the present invention based on TextRank
The TextRank for Weibo algorithms of method modification are carried out without guidance;The process is divided into the structure of the non-directed graph based on cooccurrence relation
Make and the node weights based on figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation is first by user's history
The participle occurred in microblogging is converted into corresponding node;Then to the result of each microblogging participle in, the binary for being occurred point
Word is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side;
Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank,
Untill the variable quantity of node weights converges to certain threshold values;
(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module;Close
Keyword extension will be carried out using based on the P-IOW algorithms of P-IOLog algorithm improvements, using the co-occurrence based on keyword, distribution with
And its information such as the attribute of owning user calculates the similarity between keyword, using the word of the degree of correlation high as target critical
The spreading result of word;The multiple self-defined keywords of user input are supported simultaneously;For each self-defined keyword, keyword expansion
The extension term vector that module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector;
(4) by the result of the two according to User Defined weight, using maximum method for normalizing to two result vectors
It is normalized, is mapped among a unified span;After normalization, the vector after being normalized to two is carried out
Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion, and amalgamation result is pushed away for correlation
Recommending module carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended;
(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identification
The user interest for going out, user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out
Point multiplication operation, domain correlation degree is calculated using the method for vector space dot product;
(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, complete
Into the discovery of theme, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into phase
Guan Du is calculated, and establishes thematic importance, and carry out user's recommendation according to thematic importance.
Present invention advantage compared with prior art is:
(1) the source aspect of field language material, there are problems that for keyword extraction and keyword expansion method it is respective,
Pure strategy cannot well meet microblog domain dependant information and recommend this problem, the present invention to propose to field language
The method that material is blended using the keyword extraction based on figure and the User Defined keyword expansion technology based on co-occurrence information
Model the field interest of user, it is ensured that the dynamic interest demand of user can in real time be met and greatly enhance user
The expressive faculty of self-defined keyword, changes so as to take into account portraying for realm information comprehensively with the dynamic of user interest, solves
Field user obtains the contradiction that accuracy rate during domain dependant information can not get both with recall rate in microblog.
(2) while the requirement based on practicality, the algorithm in the present invention does not rely on outside language material, without the use of extensive
Language material is analyzed, but the history microblog data based on user partial data, i.e. user is calculated, thus algorithm have compared with
The fast reaction time, with bigger practicality and portability.
(3) theme modeling is carried out again for the field microblogging text for recommending user, filter out the dry of field unrelated subject matter
Disturb, realize more accurately subject recommending, further increase the accuracy of realm information recommendation.
Brief description of the drawings
Fig. 1 is system assumption diagram of the invention;
Fig. 2 is system overview flow chart of the invention;
Fig. 3 is system sequence figure of the invention;
Fig. 4 is system framework figure of the invention;
Fig. 5 is keyword extraction submodule uml diagram major part selected parts of the present invention based on figure;
Fig. 6 is keyword expansion submodule uml diagram major part selected parts of the present invention based on co-occurrence information.
Specific embodiment
Below in conjunction with specific embodiments and the drawings, the present invention is described in detail.
As shown in figure 1, a kind of realm information commending system based on microblog of the invention, including:Data acquisition with
Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter
Calculate and personalized recommendation module and theme acquisition module.
Data acquisition and pretreatment module:For the enterprise's microblog users on Sina weibo platform, specified by analyzing it
Some fields in relevant microblog account history microblogging text, using Sina weibo open platform obtain user's relevant historical it is micro-
Rich information completes data prediction work simultaneously, i.e., using the method for pattern match, stop words is filtered;Use participle system
System ICTCLAS5.0 carries out participle and part-of-speech tagging.Pre-processed results are the history microblog data of user, are transmitted to field crucial
Word extraction module, if User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined
Word expansion module.
Field keyword extracting module:According to pre-processed results, the non-directed graph based on cooccurrence relation is carried out first and is constructed, will
The participle occurred in user's history microblogging is converted into corresponding node, to the binary occurred in the result of each microblogging participle
Participle is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side.Then it is based on again
The node weights of figure are calculated, and the weight in each stage are iterated to calculate according to the algorithm idea of PageRank, until node weights
Untill variable quantity converges to certain threshold values.After iteration terminates, the weight of each node is the important journey of the participle representated by it
Degree.The result of keyword extraction is obtained by all participles of user are ranked up according to importance degree, so that automatic identification is used
Domain features where family.
User Defined keyword expansion module:Use improved P-IOW (Probabilistic Inside-Outside
Log for Weibo) method, the information such as attribute of the co-occurrence based on keyword, distribution and its owning user calculates key
Similarity between word, using the word of the degree of correlation high as target keyword spreading result.This module supports that user input is more
Individual self-defined keyword.For each self-defined keyword, the extension term vector that keyword expansion module can be expanded to it enters
Line add and, so as to obtain final spread vector.User Defined keyword expansion function ensure that the dynamic of user is emerging
Interesting demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword.
Linear combining module:In field, keyword is automatically extracted and the extension based on User Defined keyword is completed
Afterwards, two result vectors are normalized using maximum method for normalizing, are mapped to a unified span
Among.After normalization, the vector after being normalized to two carries out linear combining, and merging process supports that User Defined keyword is carried
Take the weight with keyword expansion.Module exports a term vector for representing the final field interest of user.
Relatedness computation and personalized recommendation module:After the crucial term vector for portraying user field interest is generated, phase
Guan Du to be calculated and will carry out participle and word frequency statisticses to every microblogging to be filtered with personalized recommendation module to generate word frequency vector,
Then user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out into point multiplication operation,
To obtain the degree of correlation of the microblogging and user interest.By calculating the domain correlation degree of each user's microblogging, according to field phase
Guan Du is ranked up from high to low, realizes recommending the personalized field microblogging of user.
Theme acquisition module:It is input training LDA models with the field microblogging text for recommending user, according to the word of theme
Lexical item is clustered into theme by item distribution;The user field interest keyword that will be obtained in theme lexical item set and linear combining module
Item carries out relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, thus complete motif discovery and
Recommend.
As shown in Fig. 2 data acquisition is as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first.After being verified, system can automatically use the use
Sina weibo open platform OAuth2.0 vouchers associated by family interact to verify user identity in Sina weibo with open platform
Legitimacy on platform.If voucher does not have expired, user's checking work is completed.If voucher does not exist or has passed through
Phase, then system can be automatically brought to open platform OAuth2.0 checking the page, the page request user input its in Sina weibo
On user name and password.After user input correct information, open platform can pass the voucher after renewal back the system, the system
By the persistence voucher to ensure in the voucher term of validity, user only can log in open platform with the system username and password
And obtain microblog data and carry out associative operation.
(2) the microblogging text data for obtaining is got up the data structured ground persistence of acquisition using local data base, with
Just upper strata analysis module reads at any time.In terms of data renewal, this module supports the update method of increment type.Update every time only
The relevant microblog information that transmission user newly increases, so as to improve the response speed of system, saves the network bandwidth to greatest extent.
(3) the microblogging text next to persistence carries out pretreatment work, including stop words filtering, participle and part of speech mark
The part of note three.Microblogging stop words mainly includes following form:The topic label of " # topic words # " form, " user name " form
URL link included in orientation notice, the emoticon of " [expression word] " form and microblogging to certain user etc..For micro-
These stop words, using the method for pattern match, are filtered by rich text characteristics first.Then carried out for microblogging scene
The Chinese word segmentation and part-of-speech tagging for optimizing.Participle is text processor specific to the minority language such as Chinese, because in
Text possesses obvious separator unlike other most of language.In natural language processing, participle is to be converted into herein
Computer it will be appreciated that form inevitable operation.Especially present invention employs vector space model, participle is even more indispensable
The step of.And, the quality of word segmentation result will directly affect the quality of arithmetic result.Part-of-speech tagging refers to in given sentence
Each word assigns correct lexical token, is one highly useful pretreated for follow-up natural language processing work
Journey.This module carries out participle and part-of-speech tagging using segmenter product I CTCLAS5.0, because the neologisms language material for importing also includes
Part-of-speech tagging information, therefore can't influence the accuracy of part-of-speech tagging.Field correlation on to Sina weibo platform is micro-
Rich investigation finds that noun can accurately portray user field interest, and the word of other parts of speech is past as keyword
It is past to introduce ambiguity and cause to recommend the decline of accuracy rate.Therefore it is micro- to user before keyword extraction with keyword expansion
Result after rich participle has carried out part of speech filtering, only retains noun.Pre-processed results are the history microblog data of user.
As shown in Fig. 2 field keyword extracting module implementation process is as follows:
(1) present invention compares from TextRank algorithm relatively advanced at present in terms of keyword extraction by experiment,
Reason is as follows:TextRank algorithm overcomes the defect of TFIDF methods, and it need not calculate TF information, therefore need not be by microblogging
Merge, and it is independent of exterior I DF information.TextRank considers common with domain correlation degree keyword high when calculating
Existing word is with this hypothesis of domain correlation degree higher so that the calculating of keyword weight is no longer linear calculating, so that
The power-law distribution problem of TFMF algorithms is overcome to a certain extent.Therefore its relative other algorithm is more suitable for this application scene.
(2) this module has drawn TextRank keyword extraction algorithm thoughts, and characteristic with reference to microblogging application scenarios is carried
TextRank for Weibo algorithms are gone out.TextRank for Weibo algorithms are the keyword extraction algorithms based on figure.Its
Inspiration Sources are in PageRank algorithm ideas.In terms of the construction of figure, traditional TextRank algorithm is based on word in a document
Co-occurrence number of times in the sliding window of regular length defines the weight on the connection side between word.In view of the spy of microblogging short text
Property, the present invention uses co-occurrence number of times of the word in a microblogging as the weight on side between word.Make on this non-directed graph afterwards
Weight of each word as keyword is calculated with PageRank algorithms.The present invention defines one point of each node on behalf in figure
Word, if two words co-occurrence in certain microblogging of user, the weights on the side between its corresponding node add 1, between node
Final weights be its co-occurrence number of times of two words of correspondence in microblogging.
After the non-directed graph determines, the weight of each node is produced using the algorithm idea iteration similar to PageRank.Section
Point ViWeights be updated according to equation below:
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E
(Vi) it is ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start to change with arbitrary initial value
In generation, untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration to iteration
Less than specified numerical value.
Fig. 5 is shown in the uml class figure major part selected parts of the module, be broadly divided into the non-directed graph based on cooccurrence relation construction and
Node weights based on figure calculate two stages.
It is right that be converted into for the participle occurred in user's history microblogging first by the construction phase of the non-directed graph based on cooccurrence relation
The node answered.Then to the result of each microblogging participle in, the binary participle for being occurred is to carrying out the construction on side, the weights on side
As equivalent is to the co-occurrence number of times in history microblogging.
As shown in Fig. 2 User Defined keyword expansion module implementation process is as follows:
(1) present invention introduces User Defined keyword to strengthen based on the not enough defect of keyword extraction dynamic
The dynamic of interest modeling.Simultaneously in order to solve the self-defined hypodynamic problem of antistop list Danone, the present invention proposes key
Word extraction is modeled with the method that User Defined keyword expansion is combined to user interest.
(2) it is in view of application scenarios demand of the invention:Under the related language material background in field, will be user-defined
Some field associative keys are extended based on domain correlation degree.Therefore present invention employs word in the correlation language material of field
Co-occurrence information calculate word between similarity.This method be based on User Defined keyword in same microblogging co-occurrence
Word has the hypothesis of stronger field similarity with the keyword.
(3) present invention has received the extended method P-IOLog for topic label, it is contemplated that the expansion of the method generation
The confidence level for opening up word weight is that frequency of the self-defined keyword in language material to be analyzed is directly proportional to the size of sample space, that is, exist
In the case that microblogging number comprising User Defined keyword is less, the spreading result error of P-IOLog algorithms generation is larger.
Therefore the present invention is improved P-IOLog for this application scene, introduces drop weight factor, the factor is with User Defined
The increase of the frequency of occurrence of keyword and increase;Eliminate consideration of the former method for subject layer simultaneously, it is proposed that
Probabilistic Inside-Outside Log for Weibo (abbreviation P-IOW) method.
(4) specific method is as follows, the meter of the User Defined domain correlation degree of keyword s, word t on s for giving
Calculation method is as follows:
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMIt is quilt in the correlation language material of field
The word that most a plurality of microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;wf(t∧
S) it is while the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that the smoothing factor that codomain is (0,1) is causedIt is not in situation that divisor is zero when being zero.Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P-
The codomain of IOLogW is in log (sp) arrive log (1/sp) between.The value of result of calculation P-IOW is bigger, declarer w and User Defined
Keyword s has field similarity higher.For User Defined keyword in itself, the upper of P-IOLogW codomains will be assigned
Limit, i.e. log (1/sp)。
(5) while, the present invention supports the multiple self-defined keywords of user input.It is crucial for each self-defined keyword
The extension term vector that word expansion module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector.Propagate through
Journey specific algorithm is as follows:
1) to each User Defined keyword set keyword, first, all expansions of keyword are calculated based on P-IOW
Exhibition word weight;
2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed and expanded
Exhibition term vector;
3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module
As a result.
Fig. 6 is shown in the uml class figure major part selected parts of the module, the co-occurrence based on other participles Yu User Defined keyword
Information is extended to User Defined keyword, and spreading result is represented in vector form such that it is able to seamlessly with key
Word extracts result and merges.
As shown in Fig. 2 linear combining module implementation process is as follows:
(1) it refers to be automatically extracted in keyword and based on User Defined to carry out linear combining according to User Defined weight
After the completion of the extension of keyword is equal, the process that two result vectors are merged.
It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword
All there is respective defect in the method for interest.Keyword extraction aspect is mainly manifested in:User cannot manually adjust arithmetic result,
The interim dynamic need for certain field theme of user can not be met.And algorithm is in the case where user's history microblogging is less
Field interest modeling can not be carried out, i.e., so-called cold start-up phenomenon.The defect of User Defined keyword is mainly manifested in:Its
One, user-defined keyword are it is difficult to ensure that be fully contemplated by all information of this area;Second, in microblogging short text, only
Judge similitude by so that many different information of synonymous but word cannot be extracted merely with whether keyword occurs.
In view of the defect of both the above method, proposition field of the present invention keyword automatically extracts crucial with User Defined
The method that word mutually merges, and the co-occurrence information that User Defined keyword is based on word is extended.This method was both taken into account
Comprehensive identification to realm information, can dynamically adjust keyword at any time according to the change of user interest again;Meanwhile, user makes by oneself
Adopted keyword can to a certain extent solve cold start-up phenomenon.
(2) because the result vector of keyword extraction and keyword expansion not within a span, it is necessary to will
The two DUAL PROBLEMS OF VECTOR MAPPINGs are among same span.This module employs maximum method for normalizing to two result vectors
It is normalized, is mapped among a unified span.After normalization, the vector after being normalized to two is carried out
Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion.Module exports a representative
The term vector of the final field interest of user.
Each component in for vector, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector.Return by maximum
After one changes, vector institute it is important (0,1] between, and important be non-zero.SYSTEM OF LINEAR VECTOR weighting is carried out afterwards to close
And:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-exractIt is the result vector of keyword extraction,
Vkw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest
Vcombine。
As shown in Fig. 2 Similarity Measure is as follows with personalized recommendation module implementation process:
(1) the crucial term vector V that portrays user field interest is being completedcombineGeneration after, the degree of correlation compares mould
Block will carry out participle and word frequency statisticses to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to
Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, to obtain the microblogging and user interest
The degree of correlation.The specific formula of relatedness computation is as follows:
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Point
The IDF vectors of word are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith
The upper t of WiCorresponding component;IDF(ti) it is tiIDF values.
(2) final, model calculates the domain correlation degree of each user's microblogging.System can be according to the domain correlation degree for having obtained
Carry out microblogging recommendation.The form of recommendation is not limited, and can be from high to low ranked up according to domain correlation degree, or filter out field
The degree of correlation is more than micro-blog information of specified threshold etc..Present invention employs the first rendering method.
As shown in Fig. 2 theme acquisition module implementation process is as follows:
(1) Text Pretreatment work is carried out to the field microblogging recommended.Pretreatment work is with module 1, including stop words mistake
Filter, participle and the part of part-of-speech tagging three.
(2) with the data handled well as input training LDA models, lexical item is clustered into master by the lexical item distribution according to theme
Topic;
(3) the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation
Calculate, basis is to weigh correlation using the co-occurrence information between lexical item, after giving user field interest keyword, institute
There is the theme lexical item set containing the keyword to be counted as correlation, the theme without any field interest keyword is counted as
It is the small or incoherent degree of correlation, specific algorithm uses the relatedness computation method of module 5.
(4) degree of correlation represents the significance level of theme, and theme is ranked up by importance, and is presented to user.
The present invention devises prototype system to carry out result verification based on Sina weibo open platform, during system of the invention
Sequence figure uses flow as shown in figure 3, it illustrates totality of the invention:
1) after logging in system by user carries out user name password authentification first, system can call the OAuth of Sina weibo to verify
Module carries out the authority checking of microblogging account;If user does not bind microblogging account, microblogging account can be carried out using OAuth
Binding.
2) log in after the completion of authority checking, system can be updated to user data, by user-related micro-blog information
Local persistence is carried out, to ensure that it is ageing that subsequent analysis work.
3) after data complete local persistence, meeting is by participle and part of speech filter to the user's association area in database
Microblogging carries out participle and is filtered with part of speech, is worked in order to follow-up keyword extraction and keyword expansion.Here user is related
Field microblogging refers mainly to the micro-blog information that bound enterprise's microblogging sent out, or some fields specified by user are related
The micro-blog information that user is sent out, it might even be possible to be outside field correlation language material.The present invention is not intended to limit segmenter and part of speech mark
Device is noted, in theory Chinese segmenter and part-of-speech tagging device.Chinese Academy of Sciences's meter has been used in the system corresponding to the present invention
Researched and developed ICTCLAS5.0 participles and part-of-speech tagging system are calculated, and for the older problem of the segmenter dictionary, is introduced
The outside dictionary data of segmenter of stammering.Pre-processed results are the history microblog data of user.
4) participle will give keyword extracting module with the field relevant microblog after part of speech filtering in the form of participle word frequency
Keyword extraction is carried out, system carries out keyword extraction using the method for TextRank for Weibo, it is specific to extract result generation
The table interest characteristics of this area.
If 5) User Defined interest keyword, energy is recalled in order to further enhance User Defined keyword
Power, the system can carry out the pass based on realm information according to user-defined keyword using P-IOW keyword expansions method
Keyword extends.Spreading result is stated in the form of participle vector.If user does not have self-defined interest keyword, the step will be exported
Empty result.
6) keyword extraction based on TextRank for Weibo with based on P-IOW keyword expansion after the completion of, will
The linear weighted function based on User Defined weight is carried out to its result to merge.User can be with self-defined keyword extraction and keyword
Extend the proportion shared by respective result.Output form is the participle vector after merging, and the final field that it is used for representing user is emerging
Interest.
7) step carries out the Similarity Measure of user interest and microblogging to be recommended, it is necessary first to divided microblogging to be recommended
Word and part of speech filtering (same to step 3), in the form of being translated into participle word frequency vector.Then user field interest vector is calculated
With the vector and the dot-product of participle IDF values after microblogging to be recommended conversion, discussion as detailed above.Product is microblogging to be recommended
With the similarity of user field interest.
8) use the field microblogging text recommended as input, using the cluster of LDA model realization lexical items, form descriptor
Item set, finally calculates the importance of theme lexical item, and user is presented to according to significance level.
The Organization Chart for realizing system corresponding to the present invention is as shown in figure 4, system is based on MySQL+JSP+Servlet+
The technology stack architecture of Twitter Bootstrap, is divided into data acquisition with pretreatment module, field keyword extracting module, use
The self-defined keyword expansion module in family, linear combining module, Similarity Measure and personalized recommendation module and theme obtain mould
The big part of block six.Targeted customer of the invention is the enterprise's microblog users on Sina weibo platform, is specified by analyzing it
The history microblogging text of some field relevant microblog accounts, the domain features that system can be where automatic identification user.Meanwhile, it is
System also supports user, and dynamically self-defined keyword states the dynamic need in field of oneself, and system can be according to keyword
Co-occurrence information in history microblogging carries out keyword expansion, strengthens the semantic expressive faculty of keyword.Then, user can be with root
The keyword weight defined according to oneself models the field interest of complete personalization.Finally, according to field interest realize to
Subscribe to the filtered recommendation function and subject recommending function of microblogging in family.User is avoided to check the uncorrelated microblogging in a large amount of fields one by one,
Improve the operating efficiency of enterprise marketing personnel.
Claims (8)
1. a kind of realm information commending system based on microblog, it is characterised in that including:Data acquisition and pretreatment module,
Field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalization are pushed away
Recommend module and theme acquisition module;Wherein:
Data acquisition and pretreatment module:User's relevant microblog information data is obtained, and is pre-processed;Pretreatment includes data
Stop words filtering, participle and part-of-speech tagging;Pre-processed results are the history microblog data of user, and the field keyword of being transmitted to is carried
Modulus block;If User Defined field interest keyword, pre-processed results are simultaneously transmitted to the expansion of User Defined keyword
Exhibition module;
Field keyword extracting module:Based on pre-processed results, keyword extraction is using based on TextRank algorithm modification
TextRank for Weibo algorithms are carried out without guidance, and the algorithm includes the construction of the non-directed graph based on cooccurrence relation and is based on
The node weights of figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging
The participle of appearance is converted into corresponding node;Between node connect side construction when, using whether have between node while and while
Weight by two words, co-occurrence number of times in same piece microblogging judges the composition of co-occurrence, the weight on side is word same
Co-occurrence number of times in microblogging, if two words co-occurrence in certain microblogging of user, between node corresponding to two words
The weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging;Then the section of figure is based on again
In the point weight calculation stage, the weight in each stage is iterated to calculate, untill the variable quantity of node weights converges to certain threshold values;
After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to weight
The result for being ranked up and obtaining keyword extraction is spent, so that the domain features where automatic identification user;
User Defined keyword expansion module:Co-occurrence based on keyword, distribution and its owning user attribute information come
Calculate keyword between similarity, using the word of the degree of correlation high as target keyword spreading result;This module is supported to use
Family is input into multiple self-defined keywords, and for each self-defined keyword, the extension term vector that can go out to keyword expansion is carried out
It is linear plus and, so as to obtain final spread vector;User Defined keyword expansion function ensure that the dynamic interest of user
Demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword;
Linear combining module:After the completion of field keyword is automatically extracted and extension based on User Defined keyword is equal, adopt
Two result vectors are normalized with maximum method for normalizing, make the result vector of keyword extraction and keyword expansion
It is mapped among a unified span;After normalization, the vector after being normalized to two carries out linear combining, merges
Journey supports the weight of User Defined keyword extraction and keyword expansion;It is emerging that module output one represents the final field of user
The term vector of interest;
Relatedness computation and personalized recommendation module:Linear combining module depict user field interest crucial term vector it
Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to
Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, obtain the microblogging with user interest
The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging, by calculating the domain correlation degree of each user's microblogging, presses
It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, realizes the personalized field microblogging to user
Recommend;
Theme acquisition module:It is input training LDA models, the lexical item point according to theme with the field microblogging text for recommending user
Lexical item is clustered into theme by cloth;The user field interest key word item obtained in theme lexical item set and linear combining module is entered
Row relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, so as to complete motif discovery and push away
Recommend.
2. the realm information commending system based on microblog according to claim 1, it is characterised in that:The data are obtained
Take as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first, after being verified, automatically using associated by the user
Microblog voucher is interacted with microblog, to verify legitimacy of the user identity in microblog;
(2) the relevant microblog text that acquisition user concern is subscribed to, the data structured ground that will be obtained using local data base is lasting
Change is got up, to read at any time;
(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and the part of part-of-speech tagging three;
For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then entered for microblogging scene
Chinese word segmentation and part-of-speech tagging that row is optimized, carry out participle and part-of-speech tagging, together using segmenter product I CTCLAS5.0
When part of speech filtering was carried out to the result after user's microblogging participle before keyword extraction and keyword expansion, a reserved name
Word.Pre-processed results data are the history microblog data of user.
3. the realm information commending system based on microblog according to claim 1, it is characterised in that:Close in the field
In keyword extraction module, based on history microblog data, the weight in each stage is calculated according to the algorithm idea iteration of PageRank
Calculate, formula is as follows:
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E(Vi) be
ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start iteration, iteration with arbitrary initial value
Untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration less than specified
Numerical value.
4. the realm information commending system based on microblog according to claim 1, it is characterised in that:The user is certainly
Define in keyword expansion module, the similarity between keyword is calculated using improved P-IOW algorithms, implementation process is as follows:
The computational methods of the User Defined domain correlation degree of keyword s, word t on s for giving are as follows:
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMFor field correlation language material in by most a plurality of
The word that microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;Wf (t ∧ s) is same
When the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that codomain is the smoothing factor of (0,1) so that
It is not in situation that divisor is zero when being zero;Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P-IOLogW
Codomain in log (sp) arrive log (1/sp) between, the value of result of calculation P-IOW is bigger, declarer w and User Defined keyword
S has field similarity higher, will assign the upper limit of P-IOLogW codomains, i.e. log for User Defined keyword in itself
(1/sp)。
5. the realm information commending system based on microblog according to claim 1, it is characterised in that:The user is certainly
In defining keyword expansion module, the extension term vector gone out to keyword expansion carry out it is linear plus and, so as to obtain final expansion
During exhibition vector, expansion process specific algorithm is as follows:
(1) to each User Defined keyword set keyword, first, all extensions of keyword are calculated based on P-IOW
Word weight;
(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, form extension
Term vector;
(3) all of extension term vector of linear superposition, obtains the output knot of final spread vector, i.e. keyword expansion module
Really.
6. the realm information commending system based on microblog according to claim 1, it is characterised in that:The linear conjunction
And the implementation process of module is as follows:
(1) place is normalized using maximum normalization method respectively to the vector that keyword extraction and keyword expansion are generated
Reason;
(2) for vector in each component, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector, normalizes by maximum
Afterwards, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting merging:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-extractIt is the result vector of keyword extraction, Vkw-expandFor
The result vector of keyword expansion generation, amalgamation result is the crucial term vector V for portraying user field interestcombine。
7. the realm information commending system based on microblog according to claim 1, it is characterised in that:Described is similar
Degree is calculated with personalized recommendation module, and it is as follows that domain correlation degree calculates specific formula:
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Participle
IDF vectors are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith t on Wi
Corresponding component;IDF(ti) it is tiIDF values.
8. a kind of realm information based on microblog recommends method, it is characterised in that:It is divided into data acquisition with pretreatment, field
Keyword extraction, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six
Individual step, realizes as follows:
(1) obtaining user's relevant microblog information carries out data prediction;Pretreatment work is included using the method pair of pattern match
Stop words is filtered;Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0;Pre-processed results are user's
History microblog data;
(2) pre-processed results carry out field keyword extraction, and keyword extraction will be repaiied using the present invention based on TextRank algorithm
The TextRank for Weibo algorithms for changing are carried out without guidance, and TextRank for Weibo algorithms are divided into based on cooccurrence relation
Non-directed graph construction and node weights based on figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation is first
The participle occurred in user's history microblogging is first converted into corresponding node;Then to the result of each microblogging participle in, own
The binary participle of appearance is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side;
Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank, until
Untill the variable quantity of node weights converges to certain threshold values;
(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module;Keyword
Extending will be carried out using the P-IOW algorithms based on P-IOLog algorithm improvements, using the co-occurrence based on keyword, be distributed and it
The information such as the attribute of owning user calculate the similarity between keyword, using the word of the degree of correlation high as target keyword
Spreading result;The multiple self-defined keywords of user input are supported simultaneously;For each self-defined keyword, keyword expansion module
Extension term vector that it can be expanded carry out it is linear plus and, so as to obtain final spread vector;
(4) result of the two is carried out using maximum method for normalizing according to User Defined weight to two result vectors
Normalization, is mapped among a unified span;After normalization, the vector after being normalized to two is carried out linearly
Merge, merging process supports the weight of User Defined keyword extraction and keyword expansion, amalgamation result supplies associated recommendation mould
Block carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended;
(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identifying
User interest, dot product is carried out by user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors
Computing, domain correlation degree is calculated using the method for vector space dot product;
(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, completes master
The discovery of topic, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation
Calculate, establish thematic importance, and user's recommendation is carried out according to thematic importance.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611075431.XA CN106776881A (en) | 2016-11-28 | 2016-11-28 | A kind of realm information commending system and method based on microblog |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611075431.XA CN106776881A (en) | 2016-11-28 | 2016-11-28 | A kind of realm information commending system and method based on microblog |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106776881A true CN106776881A (en) | 2017-05-31 |
Family
ID=58900759
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611075431.XA Pending CN106776881A (en) | 2016-11-28 | 2016-11-28 | A kind of realm information commending system and method based on microblog |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776881A (en) |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229871A (en) * | 2017-07-17 | 2017-10-03 | 梧州井儿铺贸易有限公司 | A kind of safe information acquisition device |
CN107370664A (en) * | 2017-07-17 | 2017-11-21 | 陈剑桃 | A kind of effective microblogging junk user finds system |
CN107436934A (en) * | 2017-07-21 | 2017-12-05 | 上海斐讯数据通信技术有限公司 | It is a kind of to orient the system and method for subscribing to the story of a play or opera |
CN107704512A (en) * | 2017-08-31 | 2018-02-16 | 平安科技(深圳)有限公司 | Financial product based on social data recommends method, electronic installation and medium |
CN107766482A (en) * | 2017-10-13 | 2018-03-06 | 北京猎户星空科技有限公司 | Information pushes and sending method, device, electronic equipment, storage medium |
CN108255957A (en) * | 2017-12-21 | 2018-07-06 | 杭州传送门网络科技有限公司 | One kind recommends matching process based on Venture Capital field precision dataization |
CN108287916A (en) * | 2018-02-11 | 2018-07-17 | 北京方正阿帕比技术有限公司 | A kind of resource recommendation method |
CN108304371A (en) * | 2017-07-14 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium that Hot Contents excavate |
CN108319677A (en) * | 2018-01-30 | 2018-07-24 | 中南大学 | The alignment schemes of the cyberrelationship figure of dynamic change |
CN108388597A (en) * | 2018-02-01 | 2018-08-10 | 深圳市鹰硕技术有限公司 | Conference summary generation method and device |
CN108763205A (en) * | 2018-05-21 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of brand alias recognition methods, device and electronic equipment |
CN108846023A (en) * | 2018-05-24 | 2018-11-20 | 普强信息技术(北京)有限公司 | The unconventional characteristic method for digging and device of text |
CN108932318A (en) * | 2018-06-26 | 2018-12-04 | 四川政资汇智能科技有限公司 | A kind of intellectual analysis and accurate method for pushing based on Policy resources big data |
CN109034389A (en) * | 2018-08-02 | 2018-12-18 | 黄晓鸣 | Man-machine interactive modification method, device, equipment and the medium of information recommendation system |
CN109241238A (en) * | 2018-06-27 | 2019-01-18 | 广州优视网络科技有限公司 | Article search method, apparatus and electronic equipment |
CN109255126A (en) * | 2018-09-10 | 2019-01-22 | 百度在线网络技术(北京)有限公司 | Article recommended method and device |
CN109299280A (en) * | 2018-12-12 | 2019-02-01 | 河北工程大学 | Short text clustering analysis method, device and terminal device |
CN109376309A (en) * | 2018-12-28 | 2019-02-22 | 北京百度网讯科技有限公司 | Document recommendation method and device based on semantic label |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN109685085A (en) * | 2017-10-18 | 2019-04-26 | 阿里巴巴集团控股有限公司 | A kind of master map extracting method and device |
CN110019702A (en) * | 2017-09-18 | 2019-07-16 | 阿里巴巴集团控股有限公司 | Data digging method, device and equipment |
CN110110207A (en) * | 2018-01-18 | 2019-08-09 | 北京搜狗科技发展有限公司 | A kind of information recommendation method, device and electronic equipment |
CN110222160A (en) * | 2019-05-06 | 2019-09-10 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method, device and computer readable storage medium |
CN110427547A (en) * | 2018-04-26 | 2019-11-08 | 观相科技(上海)有限公司 | A kind of search system and searching method based on industrial characteristic |
CN110427480A (en) * | 2019-06-28 | 2019-11-08 | 平安科技(深圳)有限公司 | Personalized text intelligent recommendation method, apparatus and computer readable storage medium |
CN110489665A (en) * | 2019-08-16 | 2019-11-22 | 北京信息科技大学 | A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks |
CN110633408A (en) * | 2018-06-20 | 2019-12-31 | 北京正和岛信息科技有限公司 | Recommendation method and system for intelligent business information |
CN111831802A (en) * | 2020-06-04 | 2020-10-27 | 北京航空航天大学 | Urban domain knowledge detection system and method based on LDA topic model |
CN112364947A (en) * | 2021-01-14 | 2021-02-12 | 北京崔玉涛儿童健康管理中心有限公司 | Text similarity calculation method and device |
CN112749284A (en) * | 2020-12-31 | 2021-05-04 | 平安科技(深圳)有限公司 | Knowledge graph construction method, device, equipment and storage medium |
CN112784142A (en) * | 2019-10-24 | 2021-05-11 | 北京搜狗科技发展有限公司 | Information recommendation method and device |
CN112861004A (en) * | 2021-02-20 | 2021-05-28 | 中国联合网络通信集团有限公司 | Rich media determination method and device |
CN113220994A (en) * | 2021-05-08 | 2021-08-06 | 中国科学院自动化研究所 | User personalized information recommendation method based on target object enhanced representation |
CN114048374A (en) * | 2021-10-28 | 2022-02-15 | 盐城金堤科技有限公司 | Method and device for determining object to be recommended |
CN116228282A (en) * | 2023-05-09 | 2023-06-06 | 湖南惟客科技集团有限公司 | Intelligent commodity distribution method for user data tendency |
CN116244496A (en) * | 2022-12-06 | 2023-06-09 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
CN105677769A (en) * | 2015-12-29 | 2016-06-15 | 广州神马移动信息科技有限公司 | Keyword recommending method and system based on latent Dirichlet allocation (LDA) model |
-
2016
- 2016-11-28 CN CN201611075431.XA patent/CN106776881A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831234A (en) * | 2012-08-31 | 2012-12-19 | 北京邮电大学 | Personalized news recommendation device and method based on news content and theme feature |
CN105677769A (en) * | 2015-12-29 | 2016-06-15 | 广州神马移动信息科技有限公司 | Keyword recommending method and system based on latent Dirichlet allocation (LDA) model |
Non-Patent Citations (2)
Title |
---|
吴雨龙等: "一种面向企业的行业微博信息推荐方法", 《计算机应用与软件》 * |
唐晓波等: "基于文本聚类与LDA相融合的微博主题检索模型研究", 《情报理论与实践》 * |
Cited By (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304371A (en) * | 2017-07-14 | 2018-07-20 | 腾讯科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium that Hot Contents excavate |
CN108304371B (en) * | 2017-07-14 | 2021-07-13 | 腾讯科技(深圳)有限公司 | Method and device for mining hot content, computer equipment and storage medium |
CN107370664A (en) * | 2017-07-17 | 2017-11-21 | 陈剑桃 | A kind of effective microblogging junk user finds system |
CN107229871A (en) * | 2017-07-17 | 2017-10-03 | 梧州井儿铺贸易有限公司 | A kind of safe information acquisition device |
CN107436934A (en) * | 2017-07-21 | 2017-12-05 | 上海斐讯数据通信技术有限公司 | It is a kind of to orient the system and method for subscribing to the story of a play or opera |
CN107436934B (en) * | 2017-07-21 | 2023-09-08 | 杭州吉吉知识产权运营有限公司 | System and method for directionally subscribing to scenario |
CN107704512A (en) * | 2017-08-31 | 2018-02-16 | 平安科技(深圳)有限公司 | Financial product based on social data recommends method, electronic installation and medium |
CN107704512B (en) * | 2017-08-31 | 2021-08-24 | 平安科技(深圳)有限公司 | Financial product recommendation method based on social data, electronic device and medium |
CN110019702B (en) * | 2017-09-18 | 2023-04-07 | 阿里巴巴集团控股有限公司 | Data mining method, device and equipment |
CN110019702A (en) * | 2017-09-18 | 2019-07-16 | 阿里巴巴集团控股有限公司 | Data digging method, device and equipment |
CN107766482A (en) * | 2017-10-13 | 2018-03-06 | 北京猎户星空科技有限公司 | Information pushes and sending method, device, electronic equipment, storage medium |
CN109685085B (en) * | 2017-10-18 | 2023-09-26 | 阿里巴巴集团控股有限公司 | Main graph extraction method and device |
CN109685085A (en) * | 2017-10-18 | 2019-04-26 | 阿里巴巴集团控股有限公司 | A kind of master map extracting method and device |
CN108255957A (en) * | 2017-12-21 | 2018-07-06 | 杭州传送门网络科技有限公司 | One kind recommends matching process based on Venture Capital field precision dataization |
CN110110207B (en) * | 2018-01-18 | 2023-11-03 | 北京搜狗科技发展有限公司 | Information recommendation method and device and electronic equipment |
CN110110207A (en) * | 2018-01-18 | 2019-08-09 | 北京搜狗科技发展有限公司 | A kind of information recommendation method, device and electronic equipment |
CN108319677A (en) * | 2018-01-30 | 2018-07-24 | 中南大学 | The alignment schemes of the cyberrelationship figure of dynamic change |
CN108388597A (en) * | 2018-02-01 | 2018-08-10 | 深圳市鹰硕技术有限公司 | Conference summary generation method and device |
CN108287916A (en) * | 2018-02-11 | 2018-07-17 | 北京方正阿帕比技术有限公司 | A kind of resource recommendation method |
CN110427547A (en) * | 2018-04-26 | 2019-11-08 | 观相科技(上海)有限公司 | A kind of search system and searching method based on industrial characteristic |
CN108763205B (en) * | 2018-05-21 | 2022-05-03 | 创新先进技术有限公司 | Brand alias identification method and device and electronic equipment |
CN108763205A (en) * | 2018-05-21 | 2018-11-06 | 阿里巴巴集团控股有限公司 | A kind of brand alias recognition methods, device and electronic equipment |
CN108846023A (en) * | 2018-05-24 | 2018-11-20 | 普强信息技术(北京)有限公司 | The unconventional characteristic method for digging and device of text |
CN110633408B (en) * | 2018-06-20 | 2024-03-15 | 北京正和岛信息科技有限公司 | Intelligent business information recommendation method and system |
CN110633408A (en) * | 2018-06-20 | 2019-12-31 | 北京正和岛信息科技有限公司 | Recommendation method and system for intelligent business information |
CN108932318A (en) * | 2018-06-26 | 2018-12-04 | 四川政资汇智能科技有限公司 | A kind of intellectual analysis and accurate method for pushing based on Policy resources big data |
CN108932318B (en) * | 2018-06-26 | 2022-03-04 | 四川政资汇智能科技有限公司 | Intelligent analysis and accurate pushing method based on policy resource big data |
CN109241238A (en) * | 2018-06-27 | 2019-01-18 | 广州优视网络科技有限公司 | Article search method, apparatus and electronic equipment |
CN109034389A (en) * | 2018-08-02 | 2018-12-18 | 黄晓鸣 | Man-machine interactive modification method, device, equipment and the medium of information recommendation system |
CN109255126A (en) * | 2018-09-10 | 2019-01-22 | 百度在线网络技术(北京)有限公司 | Article recommended method and device |
CN109635081B (en) * | 2018-11-23 | 2023-06-13 | 上海大学 | Text keyword weight calculation method based on word frequency power law distribution characteristics |
CN109635081A (en) * | 2018-11-23 | 2019-04-16 | 上海大学 | A kind of text key word weighing computation method based on word frequency power-law distribution characteristic |
CN109299280A (en) * | 2018-12-12 | 2019-02-01 | 河北工程大学 | Short text clustering analysis method, device and terminal device |
CN109376309B (en) * | 2018-12-28 | 2022-05-17 | 北京百度网讯科技有限公司 | Document recommendation method and device based on semantic tags |
CN109376309A (en) * | 2018-12-28 | 2019-02-22 | 北京百度网讯科技有限公司 | Document recommendation method and device based on semantic label |
US11216504B2 (en) | 2018-12-28 | 2022-01-04 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Document recommendation method and device based on semantic tag |
CN110222160A (en) * | 2019-05-06 | 2019-09-10 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method, device and computer readable storage medium |
CN110222160B (en) * | 2019-05-06 | 2023-09-15 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method and device and computer readable storage medium |
WO2020258481A1 (en) * | 2019-06-28 | 2020-12-30 | 平安科技(深圳)有限公司 | Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium |
CN110427480A (en) * | 2019-06-28 | 2019-11-08 | 平安科技(深圳)有限公司 | Personalized text intelligent recommendation method, apparatus and computer readable storage medium |
CN110489665B (en) * | 2019-08-16 | 2023-11-14 | 北京信息科技大学 | Microblog personalized recommendation method based on scene modeling and convolutional neural network |
CN110489665A (en) * | 2019-08-16 | 2019-11-22 | 北京信息科技大学 | A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks |
CN112784142A (en) * | 2019-10-24 | 2021-05-11 | 北京搜狗科技发展有限公司 | Information recommendation method and device |
CN111831802B (en) * | 2020-06-04 | 2023-05-26 | 北京航空航天大学 | Urban domain knowledge detection system and method based on LDA topic model |
CN111831802A (en) * | 2020-06-04 | 2020-10-27 | 北京航空航天大学 | Urban domain knowledge detection system and method based on LDA topic model |
CN112749284B (en) * | 2020-12-31 | 2021-12-17 | 平安科技(深圳)有限公司 | Knowledge graph construction method, device, equipment and storage medium |
CN112749284A (en) * | 2020-12-31 | 2021-05-04 | 平安科技(深圳)有限公司 | Knowledge graph construction method, device, equipment and storage medium |
CN112364947B (en) * | 2021-01-14 | 2021-06-29 | 北京育学园健康管理中心有限公司 | Text similarity calculation method and device |
CN112364947A (en) * | 2021-01-14 | 2021-02-12 | 北京崔玉涛儿童健康管理中心有限公司 | Text similarity calculation method and device |
CN112861004A (en) * | 2021-02-20 | 2021-05-28 | 中国联合网络通信集团有限公司 | Rich media determination method and device |
CN112861004B (en) * | 2021-02-20 | 2024-02-06 | 中国联合网络通信集团有限公司 | Method and device for determining rich media |
CN113220994A (en) * | 2021-05-08 | 2021-08-06 | 中国科学院自动化研究所 | User personalized information recommendation method based on target object enhanced representation |
CN113220994B (en) * | 2021-05-08 | 2022-10-28 | 中国科学院自动化研究所 | User personalized information recommendation method based on target object enhanced representation |
CN114048374A (en) * | 2021-10-28 | 2022-02-15 | 盐城金堤科技有限公司 | Method and device for determining object to be recommended |
CN116244496A (en) * | 2022-12-06 | 2023-06-09 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
CN116244496B (en) * | 2022-12-06 | 2023-12-01 | 山东紫菜云数字科技有限公司 | Resource recommendation method based on industrial chain |
CN116228282A (en) * | 2023-05-09 | 2023-06-06 | 湖南惟客科技集团有限公司 | Intelligent commodity distribution method for user data tendency |
CN116228282B (en) * | 2023-05-09 | 2023-08-11 | 湖南惟客科技集团有限公司 | Intelligent commodity distribution method for user data tendency |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776881A (en) | A kind of realm information commending system and method based on microblog | |
Zhao et al. | Connecting social media to e-commerce: Cold-start product recommendation using microblogging information | |
CN104899273B (en) | A kind of Web Personalization method based on topic and relative entropy | |
US7519588B2 (en) | Keyword characterization and application | |
CN101420313B (en) | Method and system for clustering customer terminal user group | |
CN103324665B (en) | Hot spot information extraction method and device based on micro-blog | |
Ye et al. | Web services classification based on wide & Bi-LSTM model | |
CN105095433B (en) | Entity recommended method and device | |
CN111552799B (en) | Information processing method, information processing device, electronic equipment and storage medium | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
CN102043843A (en) | Method and obtaining device for obtaining target entry based on target application | |
CN105740448B (en) | More microblogging timing abstract methods towards topic | |
CN111552797B (en) | Name prediction model training method and device, electronic equipment and storage medium | |
Lytvyn et al. | Textual Content Categorizing Technology Development Based on Ontology. | |
CN110472043A (en) | A kind of clustering method and device for comment text | |
CN106202065A (en) | A kind of across language topic detecting method and system | |
Kwapong et al. | A knowledge graph based framework for web API recommendation | |
Khan et al. | Collaborative filtering based online recommendation systems: A survey | |
Hu et al. | A Web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph | |
Jinarat et al. | Short text clustering based on word semantic graph with word embedding model | |
CN103095849A (en) | A method and a system of spervised web service finding based on attribution forecast and error correction of quality of service (QoS) | |
Chen et al. | Learning the structures of online asynchronous conversations | |
CN105426382A (en) | Music recommendation method based on emotional context awareness of Personal Rank | |
KR20180113444A (en) | Method, apparauts and system for named entity linking and computer program thereof | |
Abulaish et al. | A layered approach for summarization and context learning from microblogging data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170531 |
|
WD01 | Invention patent application deemed withdrawn after publication |