CN106776881A

CN106776881A - A kind of realm information commending system and method based on microblog

Info

Publication number: CN106776881A
Application number: CN201611075431.XA
Authority: CN
Inventors: 杨燕; 王帅; 徐良; 徐罡; 田申
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2016-11-28
Filing date: 2016-11-28
Publication date: 2017-05-31

Abstract

The invention discloses a kind of realm information commending system based on microblog and method, including：Data acquisition and pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalized recommendation module and theme acquisition module；The present invention is directed to the characteristics design of microblog and realizes a kind of realm information and recommends method, keyword extraction and keyword expansion are carried out into seamless combination, so as to the extraction that both ensure that domain features in turn ensure that the dynamic of recommendation results, the experiment of Sina weibo is based on by correspondence system, the validity of this method is demonstrated.The present invention can the marketing of auxiliary enterprises microblogging, effectively improve the efficiency of enterprise microblogging marketing.

Description

A kind of realm information commending system and method based on microblog

Technical field

The present invention relates to a kind of field microblogging commending system under microblog and method, support that guideless field is special Extraction and User Defined keyword are levied, belongs to field of computer technology.

Background technology

As internet enters the rear WEB2.0 epoch, social functions have become the model of internet great change.Major social activities Website occurs and occupies the dominant position of internet rapidly like the mushrooms after rain, early in March, 2010, famous American social network Stand Facebook^TMJust U.S. maximum website is leapt in visit capacity more than Google.At home, the emerging social media such as Sina weibo Also it is rapid to emerge, end on May 16th, 2012, the number of users of Sina weibo has reached 300,000,000, at home the use of Internet market The Tencent QQ product accumulated by more than ten years is only second in the scale of family.On the other hand, according to the most authoritative IT researchs in the whole world and Gu The consulting firm Gartner great strategy technical reports of IT industries in 2011 ten of issue are asked, the technology directly related with social activity is just Two are account for, is respectively Social Communications and Collaboration and Social Analytics.It is also Uniquely two technology major classes at seat are occupied in all technologies.Attention rate and its prospect of the people to social product as can be seen here.

On the other hand, people are used internet popularization and the enhancing that uses internet viscosity so that big data into For the focus of IT circles concern in recent years.But, big data is converted into the value useful to the mankind, it is necessary to data mining etc. The support of correlation technique.Therefore, data mining in recent years and the temperature of analysis are also to soar all the way.Especially enterprise-oriented number According to excavating and analyzing, because it can bring direct interests for enterprise.And based on dividing that a large number of users True Data is produced Analysis result has stronger reliability and convincingness compared to traditional analytical technology, and this is a kind of analysis side in real time Method, can better adapt to turn of the market, preferably catch the market opportunity.

Although microblog contains various realm informations, and for field event reaction quickly, thereon Obtain more comprehensive realm information and still face many difficulties.The rise of microblog and the rapid growth of user bring letter Breath overload problem, the increase of quantity is paid close attention to user, and the content unrelated with field is also increasingly appearing in user's subscription Microblogging in；Meanwhile, if user only focuses on a small amount of field associated user high, will cause obtain in time comprehensively Realm information.I.e. data acquisition of the user in microblog has that accuracy rate can not get both with recall rate.And it is micro- Rich platform also has theme dispersion and the characteristics of information fragmentation in itself, and this requires a kind of enterprise's microblog users that are capable of identify that Field interest, and the method that micro-blog information is extracted and recommended according to domain correlation degree.However, existing social media pipe Reason carries out simple word just with User Defined keyword mostly with analysis software in terms of realm information extraction Match somebody with somebody, this method has very big defect.First, individual keywords can not comprehensively portray realm information demand；Secondly, language It is rich so that simple characters matching effect is limited.

It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword All there is respective defect in the method for interest.

Keyword extracting method is mainly manifested in：This is a kind of guideless extraction algorithm, and user cannot dynamically adjust calculation Method result, therefore interim dynamic need of the user for certain field theme can not be met.And algorithm is in user's history microblogging Field interest modeling can not be carried out in the case of less, i.e., so-called cold start-up phenomenon.

The defect of User Defined keyword is mainly manifested in：First, user-defined keyword is it is difficult to ensure that completely Cover all information of this area；Second, in microblogging short text, judging that similitude will just with whether keyword occurs So that many different information of synonymous but word cannot be extracted.

On the other hand, related research work is applied and respective problem is there is also in this application scene.First, social network The rise of network is that data mining and data analysis correlation technique provide the new application scenarios being of great value, social networks For data mining provides new visual angle with analysis, also cause that traditional data mining technology is faced with new challenges.

In sum, above method is applied in the reality system of microblog, be there are problems that following several big：

(1) keyword extraction and keyword expansion there are problems that respective, and pure strategy cannot well meet micro- The demand that rich platform domain dependant information is recommended.

(2) in related research work, because the characteristics of information fragmentation, a class algorithm is in training language in social platform Material aspect needs to be carried out by outside language material.This greatly reduces the practicality and portability of algorithm.

(3) another kind of algorithm is based on global language material carries out setting up model, and such algorithm calculates cost very greatly and without general All over applicability.

Therefore a kind of recommendation method is needed, keyword can be based on, there is provided realm information personalized recommendation, to help user Rapidly and accurately obtain domain dependant information.The method needs to solve enterprise customer in microblog and obtains domain dependant information mistake The contradiction that accuracy rate can not get both with recall rate in journey.Requirement based on practicality simultaneously, the algorithm cannot rely upon outside language Material, and need to be calculated based on user partial data.It is therefore proposed that a kind of recommendation method with above property, is this The focus of invention.

The content of the invention

It is an object of the invention to：Overcome the deficiencies in the prior art, there is provided a kind of field based on microblog Information recommendation system and method, the history microblogging based on user, it is proposed that keyword extraction is combined to build with keyword expansion The method of mould user interest, not only ensure that comprehensive identification of realm information but also had allowed users to dynamically adjust oneself according to demand Field interest；Using the keyword extraction algorithm TextRank based on figure, other language materials are not relied on, and avoid extraction Result is influenceed by zipf law phenomenon present in language model, and proposes that a kind of P-IOW algorithms of optimization realize key Word preferably extends.This method ensure that the dynamic interest demand of user can in real time be met and greatly enhance user The expressive faculty of self-defined keyword.Carried out linearly according to User Defined weight with the result of extension by by keyword extraction Merge, can be user-customized recommended association area micro-blog information and theme, help user rapidly and accurately to obtain field phase Pass information.

The technology of the present invention solution：A kind of realm information commending system based on microblog, including：Data acquisition with Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter Calculate and personalized recommendation module and theme acquisition module；Wherein：

Data acquisition and pretreatment module：User's relevant microblog information data is obtained, and is pre-processed；Pretreatment includes The stop words filtering of data, participle and part-of-speech tagging；Pre-processed results are the history microblog data of user, are transmitted to field crucial Word extraction module；If User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined Word expansion module；

Field keyword extracting module：Based on pre-processed results, keyword extraction is used based on TextRank algorithm modification TextRank for Weibo algorithms without instruct carry out, the algorithm include based on cooccurrence relation non-directed graph construction and base Two stages are calculated in the node weights of figure；The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging The participle of middle appearance is converted into corresponding node；Between node connect side construction when, using whether have between node side and Co-occurrence number of times of the weight on side by two words in same piece microblogging judges the composition of co-occurrence, and the weight on side is word same Co-occurrence number of times in one microblogging, if two words co-occurrence in certain microblogging of user, node corresponding to two words it Between the weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging；Then figure is based on again Node weights calculation stages, iterate to calculate the weight in each stage, are until the variable quantity of node weights converges to certain threshold values Only；After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to Importance degree is ranked up the result for obtaining keyword extraction, so that the domain features where automatic identification user；

User Defined keyword expansion module：The attribute letter of the co-occurrence based on keyword, distribution and its owning user Breath calculates the similarity between keyword, using the word of the degree of correlation high as target keyword spreading result；This module branch The multiple self-defined keywords of user input are held, for each self-defined keyword, the extension term vector that can go out to keyword expansion Carry out it is linear plus and, so as to obtain final spread vector；User Defined keyword expansion function ensure that the dynamic of user Interest demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword；

Linear combining module：In field, keyword is automatically extracted and the extension based on User Defined keyword is completed Afterwards, two result vectors are normalized using maximum method for normalizing, make the knot of keyword extraction and keyword expansion Fruit DUAL PROBLEMS OF VECTOR MAPPING is among a unified span；After normalization, the vector after being normalized to two carries out linear combining, Merging process supports the weight of User Defined keyword extraction and keyword expansion；It is final that module output one represents user The term vector of field interest；

Relatedness computation and personalized recommendation module：Linear combining module depicts the crucial term vector of user field interest Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered to generate word frequency vector, then by user interest keyword The word frequency vector and IDF information vectors of microblogging generation vectorial, to be recommended carry out point multiplication operation, obtain the microblogging and user interest The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging.By calculating the domain correlation degree of each user's microblogging, It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, is realized micro- to the personalized field of user It is rich to recommend；

Theme acquisition module：It is input training LDA models with the field microblogging text for recommending user, according to the word of theme Lexical item is clustered into theme by item distribution；The user field interest keyword that will be obtained in theme lexical item set and linear combining module Item carries out relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, thus complete motif discovery and Recommend.

The data acquisition is as follows with pretreatment module implementation process：

(1) after User logs in microblog system, user's checking is carried out first, after being verified, closed using the user automatically The microblog voucher of connection is interacted with microblog, to verify legitimacy of the user identity in microblog；

(2) user's relevant microblog information data is obtained, the data structured ground persistence that will be obtained using local data base Get up, to read at any time；

(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and part-of-speech tagging three Point；For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then for microblogging Chinese word segmentation and part-of-speech tagging that scape has been optimized, participle and part of speech mark are carried out using segmenter product I CTCLAS5.0 Note, while carrying out part of speech filtering to the result after user's microblogging participle before keyword extraction and keyword expansion, only protects Leave behind a good reputation word.Pre-processed results are the history microblog data of user.

In the field keyword extracting module, the weight in each stage is calculated according to the algorithm idea iteration of PageRank Calculate, formula is as follows：

Wherein：V_iIt is i-th node, TR (V_i) it is node V_iWeight, w_ijIt is node V_iAnd V_jBetween side weight；E (V_i) it is V_iThe set on the side for being connected；D is the damped coefficient of iteration, is set to 0.85, can start to change with arbitrary initial value In generation, untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration to iteration Less than specified numerical value.

In the User Defined keyword expansion module, the phase between keyword is calculated using improved P-IOW algorithms Like spending, implementation process is as follows：

The computational methods of the User Defined domain correlation degree of keyword s, word t on s for giving are as follows：

Wherein：

Wherein：S is User Defined keyword；Wf (t) is the microblogging number comprising word t；t_MIt is quilt in the correlation language material of field The word that most a plurality of microblogging is included,It is t_MThe bar number of place microblogging；It is the microblogging number not comprising word s；wf(t∧ S) it is while the microblogging number comprising word t and word s；N is user's microblogging sum；s_pIt is that the smoothing factor that codomain is (0,1) is causedIt is not in situation that divisor is zero when being zero.Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P- The codomain of IOLogW is in log (s_p) arrive log (1/s_p) between.The value of result of calculation P-IOW is bigger, declarer w and User Defined Keyword s has field similarity higher.For User Defined keyword in itself, the upper of P-IOLogW codomains will be assigned Limit, i.e. log (1/s_p)。

In the User Defined keyword expansion module, the extension term vector gone out to keyword expansion linearly add With so as to obtain during final spread vector, expansion process specific algorithm is as follows：

(1) to each User Defined keyword set keyword, first, all of keyword are calculated based on P-IOW Expansion word weight；

(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed Extension term vector；

(3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module As a result.

The implementation process of the linear combining module is as follows：

(1) vector generated to keyword extraction and keyword expansion using maximum normalization method is normalized respectively Treatment；

(2) for vector in each component, specific method for normalizing is as follows：

v_normal=v/v_max

Wherein：V is the initial value of vectorial a certain component；v_maxMaximum in important for vector, returns by maximum One change after, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting conjunction And：

V_combine=r × V_kw-extract+(1-r)×V_kw-expand

Wherein：R is user-defined merging weight proportion, V_kw-extractIt is the result vector of keyword extraction, V_kw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest V_combine。

In described Similarity Measure and personalized recommendation module, it is as follows that domain correlation degree calculates specific formula：

Wherein：The crucial term vector of user field interest is V_combine；The corresponding word frequency vector of microblogging T to be recommended is W；Point The IDF vectors of word are V_IDF；L is the vector space dimension sum after user's history microblogging participle；WithRespectively V_combineWith The upper t of W_iCorresponding component；IDF(t_i) it is t_iIDF values.

A kind of realm information based on microblog recommends method, is divided into data acquisition and is carried with pretreatment, field keyword Take, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six steps, Realize as follows：

(1) obtaining user's relevant microblog information carries out data prediction；Pretreatment work is included using the side of pattern match Method is filtered to stop words；Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0；Pre-processed results are use The history microblog data at family.

(2) pre-processed results carry out field keyword extraction, and keyword extraction will be calculated using the present invention based on TextRank The TextRank for Weibo algorithms of method modification are carried out without guidance；The process is divided into the structure of the non-directed graph based on cooccurrence relation Make and the node weights based on figure calculate two stages；The construction phase of the non-directed graph based on cooccurrence relation is first by user's history The participle occurred in microblogging is converted into corresponding node；Then to the result of each microblogging participle in, the binary for being occurred point Word is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side；

Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank, Untill the variable quantity of node weights converges to certain threshold values；

(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module；Close Keyword extension will be carried out using based on the P-IOW algorithms of P-IOLog algorithm improvements, using the co-occurrence based on keyword, distribution with And its information such as the attribute of owning user calculates the similarity between keyword, using the word of the degree of correlation high as target critical The spreading result of word；The multiple self-defined keywords of user input are supported simultaneously；For each self-defined keyword, keyword expansion The extension term vector that module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector；

(4) by the result of the two according to User Defined weight, using maximum method for normalizing to two result vectors It is normalized, is mapped among a unified span；After normalization, the vector after being normalized to two is carried out Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion, and amalgamation result is pushed away for correlation Recommending module carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended；

(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identification The user interest for going out, user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out Point multiplication operation, domain correlation degree is calculated using the method for vector space dot product；

(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, complete Into the discovery of theme, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into phase Guan Du is calculated, and establishes thematic importance, and carry out user's recommendation according to thematic importance.

Present invention advantage compared with prior art is：

(1) the source aspect of field language material, there are problems that for keyword extraction and keyword expansion method it is respective, Pure strategy cannot well meet microblog domain dependant information and recommend this problem, the present invention to propose to field language The method that material is blended using the keyword extraction based on figure and the User Defined keyword expansion technology based on co-occurrence information Model the field interest of user, it is ensured that the dynamic interest demand of user can in real time be met and greatly enhance user The expressive faculty of self-defined keyword, changes so as to take into account portraying for realm information comprehensively with the dynamic of user interest, solves Field user obtains the contradiction that accuracy rate during domain dependant information can not get both with recall rate in microblog.

(2) while the requirement based on practicality, the algorithm in the present invention does not rely on outside language material, without the use of extensive Language material is analyzed, but the history microblog data based on user partial data, i.e. user is calculated, thus algorithm have compared with The fast reaction time, with bigger practicality and portability.

(3) theme modeling is carried out again for the field microblogging text for recommending user, filter out the dry of field unrelated subject matter Disturb, realize more accurately subject recommending, further increase the accuracy of realm information recommendation.

Brief description of the drawings

Fig. 1 is system assumption diagram of the invention；

Fig. 2 is system overview flow chart of the invention；

Fig. 3 is system sequence figure of the invention；

Fig. 4 is system framework figure of the invention；

Fig. 5 is keyword extraction submodule uml diagram major part selected parts of the present invention based on figure；

Fig. 6 is keyword expansion submodule uml diagram major part selected parts of the present invention based on co-occurrence information.

Specific embodiment

Below in conjunction with specific embodiments and the drawings, the present invention is described in detail.

As shown in figure 1, a kind of realm information commending system based on microblog of the invention, including：Data acquisition with Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter Calculate and personalized recommendation module and theme acquisition module.

Data acquisition and pretreatment module：For the enterprise's microblog users on Sina weibo platform, specified by analyzing it Some fields in relevant microblog account history microblogging text, using Sina weibo open platform obtain user's relevant historical it is micro- Rich information completes data prediction work simultaneously, i.e., using the method for pattern match, stop words is filtered；Use participle system System ICTCLAS5.0 carries out participle and part-of-speech tagging.Pre-processed results are the history microblog data of user, are transmitted to field crucial Word extraction module, if User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined Word expansion module.

Field keyword extracting module：According to pre-processed results, the non-directed graph based on cooccurrence relation is carried out first and is constructed, will The participle occurred in user's history microblogging is converted into corresponding node, to the binary occurred in the result of each microblogging participle Participle is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side.Then it is based on again The node weights of figure are calculated, and the weight in each stage are iterated to calculate according to the algorithm idea of PageRank, until node weights Untill variable quantity converges to certain threshold values.After iteration terminates, the weight of each node is the important journey of the participle representated by it Degree.The result of keyword extraction is obtained by all participles of user are ranked up according to importance degree, so that automatic identification is used Domain features where family.

User Defined keyword expansion module：Use improved P-IOW (Probabilistic Inside-Outside Log for Weibo) method, the information such as attribute of the co-occurrence based on keyword, distribution and its owning user calculates key Similarity between word, using the word of the degree of correlation high as target keyword spreading result.This module supports that user input is more Individual self-defined keyword.For each self-defined keyword, the extension term vector that keyword expansion module can be expanded to it enters Line add and, so as to obtain final spread vector.User Defined keyword expansion function ensure that the dynamic of user is emerging Interesting demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword.

Linear combining module：In field, keyword is automatically extracted and the extension based on User Defined keyword is completed Afterwards, two result vectors are normalized using maximum method for normalizing, are mapped to a unified span Among.After normalization, the vector after being normalized to two carries out linear combining, and merging process supports that User Defined keyword is carried Take the weight with keyword expansion.Module exports a term vector for representing the final field interest of user.

Relatedness computation and personalized recommendation module：After the crucial term vector for portraying user field interest is generated, phase Guan Du to be calculated and will carry out participle and word frequency statisticses to every microblogging to be filtered with personalized recommendation module to generate word frequency vector, Then user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out into point multiplication operation, To obtain the degree of correlation of the microblogging and user interest.By calculating the domain correlation degree of each user's microblogging, according to field phase Guan Du is ranked up from high to low, realizes recommending the personalized field microblogging of user.

As shown in Fig. 2 data acquisition is as follows with pretreatment module implementation process：

(1) after User logs in microblog system, user's checking is carried out first.After being verified, system can automatically use the use Sina weibo open platform OAuth2.0 vouchers associated by family interact to verify user identity in Sina weibo with open platform Legitimacy on platform.If voucher does not have expired, user's checking work is completed.If voucher does not exist or has passed through Phase, then system can be automatically brought to open platform OAuth2.0 checking the page, the page request user input its in Sina weibo On user name and password.After user input correct information, open platform can pass the voucher after renewal back the system, the system By the persistence voucher to ensure in the voucher term of validity, user only can log in open platform with the system username and password And obtain microblog data and carry out associative operation.

(2) the microblogging text data for obtaining is got up the data structured ground persistence of acquisition using local data base, with Just upper strata analysis module reads at any time.In terms of data renewal, this module supports the update method of increment type.Update every time only The relevant microblog information that transmission user newly increases, so as to improve the response speed of system, saves the network bandwidth to greatest extent.

(3) the microblogging text next to persistence carries out pretreatment work, including stop words filtering, participle and part of speech mark The part of note three.Microblogging stop words mainly includes following form：The topic label of " # topic words # " form, " user name " form URL link included in orientation notice, the emoticon of " [expression word] " form and microblogging to certain user etc..For micro- These stop words, using the method for pattern match, are filtered by rich text characteristics first.Then carried out for microblogging scene The Chinese word segmentation and part-of-speech tagging for optimizing.Participle is text processor specific to the minority language such as Chinese, because in Text possesses obvious separator unlike other most of language.In natural language processing, participle is to be converted into herein Computer it will be appreciated that form inevitable operation.Especially present invention employs vector space model, participle is even more indispensable The step of.And, the quality of word segmentation result will directly affect the quality of arithmetic result.Part-of-speech tagging refers to in given sentence Each word assigns correct lexical token, is one highly useful pretreated for follow-up natural language processing work Journey.This module carries out participle and part-of-speech tagging using segmenter product I CTCLAS5.0, because the neologisms language material for importing also includes Part-of-speech tagging information, therefore can't influence the accuracy of part-of-speech tagging.Field correlation on to Sina weibo platform is micro- Rich investigation finds that noun can accurately portray user field interest, and the word of other parts of speech is past as keyword It is past to introduce ambiguity and cause to recommend the decline of accuracy rate.Therefore it is micro- to user before keyword extraction with keyword expansion Result after rich participle has carried out part of speech filtering, only retains noun.Pre-processed results are the history microblog data of user.

As shown in Fig. 2 field keyword extracting module implementation process is as follows：

(1) present invention compares from TextRank algorithm relatively advanced at present in terms of keyword extraction by experiment, Reason is as follows：TextRank algorithm overcomes the defect of TFIDF methods, and it need not calculate TF information, therefore need not be by microblogging Merge, and it is independent of exterior I DF information.TextRank considers common with domain correlation degree keyword high when calculating Existing word is with this hypothesis of domain correlation degree higher so that the calculating of keyword weight is no longer linear calculating, so that The power-law distribution problem of TFMF algorithms is overcome to a certain extent.Therefore its relative other algorithm is more suitable for this application scene.

(2) this module has drawn TextRank keyword extraction algorithm thoughts, and characteristic with reference to microblogging application scenarios is carried TextRank for Weibo algorithms are gone out.TextRank for Weibo algorithms are the keyword extraction algorithms based on figure.Its Inspiration Sources are in PageRank algorithm ideas.In terms of the construction of figure, traditional TextRank algorithm is based on word in a document Co-occurrence number of times in the sliding window of regular length defines the weight on the connection side between word.In view of the spy of microblogging short text Property, the present invention uses co-occurrence number of times of the word in a microblogging as the weight on side between word.Make on this non-directed graph afterwards Weight of each word as keyword is calculated with PageRank algorithms.The present invention defines one point of each node on behalf in figure Word, if two words co-occurrence in certain microblogging of user, the weights on the side between its corresponding node add 1, between node Final weights be its co-occurrence number of times of two words of correspondence in microblogging.

After the non-directed graph determines, the weight of each node is produced using the algorithm idea iteration similar to PageRank.Section Point V_iWeights be updated according to equation below：

Fig. 5 is shown in the uml class figure major part selected parts of the module, be broadly divided into the non-directed graph based on cooccurrence relation construction and Node weights based on figure calculate two stages.

It is right that be converted into for the participle occurred in user's history microblogging first by the construction phase of the non-directed graph based on cooccurrence relation The node answered.Then to the result of each microblogging participle in, the binary participle for being occurred is to carrying out the construction on side, the weights on side As equivalent is to the co-occurrence number of times in history microblogging.

As shown in Fig. 2 User Defined keyword expansion module implementation process is as follows：

(1) present invention introduces User Defined keyword to strengthen based on the not enough defect of keyword extraction dynamic The dynamic of interest modeling.Simultaneously in order to solve the self-defined hypodynamic problem of antistop list Danone, the present invention proposes key Word extraction is modeled with the method that User Defined keyword expansion is combined to user interest.

(2) it is in view of application scenarios demand of the invention：Under the related language material background in field, will be user-defined Some field associative keys are extended based on domain correlation degree.Therefore present invention employs word in the correlation language material of field Co-occurrence information calculate word between similarity.This method be based on User Defined keyword in same microblogging co-occurrence Word has the hypothesis of stronger field similarity with the keyword.

(3) present invention has received the extended method P-IOLog for topic label, it is contemplated that the expansion of the method generation The confidence level for opening up word weight is that frequency of the self-defined keyword in language material to be analyzed is directly proportional to the size of sample space, that is, exist In the case that microblogging number comprising User Defined keyword is less, the spreading result error of P-IOLog algorithms generation is larger. Therefore the present invention is improved P-IOLog for this application scene, introduces drop weight factor, the factor is with User Defined The increase of the frequency of occurrence of keyword and increase；Eliminate consideration of the former method for subject layer simultaneously, it is proposed that Probabilistic Inside-Outside Log for Weibo (abbreviation P-IOW) method.

(4) specific method is as follows, the meter of the User Defined domain correlation degree of keyword s, word t on s for giving Calculation method is as follows：

Wherein：

(5) while, the present invention supports the multiple self-defined keywords of user input.It is crucial for each self-defined keyword The extension term vector that word expansion module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector.Propagate through Journey specific algorithm is as follows：

1) to each User Defined keyword set keyword, first, all expansions of keyword are calculated based on P-IOW Exhibition word weight；

2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed and expanded Exhibition term vector；

3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module As a result.

Fig. 6 is shown in the uml class figure major part selected parts of the module, the co-occurrence based on other participles Yu User Defined keyword Information is extended to User Defined keyword, and spreading result is represented in vector form such that it is able to seamlessly with key Word extracts result and merges.

As shown in Fig. 2 linear combining module implementation process is as follows：

(1) it refers to be automatically extracted in keyword and based on User Defined to carry out linear combining according to User Defined weight After the completion of the extension of keyword is equal, the process that two result vectors are merged.

It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword All there is respective defect in the method for interest.Keyword extraction aspect is mainly manifested in：User cannot manually adjust arithmetic result, The interim dynamic need for certain field theme of user can not be met.And algorithm is in the case where user's history microblogging is less Field interest modeling can not be carried out, i.e., so-called cold start-up phenomenon.The defect of User Defined keyword is mainly manifested in：Its One, user-defined keyword are it is difficult to ensure that be fully contemplated by all information of this area；Second, in microblogging short text, only Judge similitude by so that many different information of synonymous but word cannot be extracted merely with whether keyword occurs.

In view of the defect of both the above method, proposition field of the present invention keyword automatically extracts crucial with User Defined The method that word mutually merges, and the co-occurrence information that User Defined keyword is based on word is extended.This method was both taken into account Comprehensive identification to realm information, can dynamically adjust keyword at any time according to the change of user interest again；Meanwhile, user makes by oneself Adopted keyword can to a certain extent solve cold start-up phenomenon.

(2) because the result vector of keyword extraction and keyword expansion not within a span, it is necessary to will The two DUAL PROBLEMS OF VECTOR MAPPINGs are among same span.This module employs maximum method for normalizing to two result vectors It is normalized, is mapped among a unified span.After normalization, the vector after being normalized to two is carried out Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion.Module exports a representative The term vector of the final field interest of user.

Each component in for vector, specific method for normalizing is as follows：

v_normal=v/v_max

Wherein：V is the initial value of vectorial a certain component；v_maxMaximum in important for vector.Return by maximum After one changes, vector institute it is important (0,1] between, and important be non-zero.SYSTEM OF LINEAR VECTOR weighting is carried out afterwards to close And：

V_combine=r × V_kw-extract+(1-r)×V_kw-expand

Wherein：R is user-defined merging weight proportion, V_kw-exractIt is the result vector of keyword extraction, V_kw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest V_combine。

As shown in Fig. 2 Similarity Measure is as follows with personalized recommendation module implementation process：

(1) the crucial term vector V that portrays user field interest is being completed_combineGeneration after, the degree of correlation compares mould Block will carry out participle and word frequency statisticses to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, to obtain the microblogging and user interest The degree of correlation.The specific formula of relatedness computation is as follows：

(2) final, model calculates the domain correlation degree of each user's microblogging.System can be according to the domain correlation degree for having obtained Carry out microblogging recommendation.The form of recommendation is not limited, and can be from high to low ranked up according to domain correlation degree, or filter out field The degree of correlation is more than micro-blog information of specified threshold etc..Present invention employs the first rendering method.

As shown in Fig. 2 theme acquisition module implementation process is as follows：

(1) Text Pretreatment work is carried out to the field microblogging recommended.Pretreatment work is with module 1, including stop words mistake Filter, participle and the part of part-of-speech tagging three.

(2) with the data handled well as input training LDA models, lexical item is clustered into master by the lexical item distribution according to theme Topic；

(3) the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation Calculate, basis is to weigh correlation using the co-occurrence information between lexical item, after giving user field interest keyword, institute There is the theme lexical item set containing the keyword to be counted as correlation, the theme without any field interest keyword is counted as It is the small or incoherent degree of correlation, specific algorithm uses the relatedness computation method of module 5.

(4) degree of correlation represents the significance level of theme, and theme is ranked up by importance, and is presented to user.

The present invention devises prototype system to carry out result verification based on Sina weibo open platform, during system of the invention Sequence figure uses flow as shown in figure 3, it illustrates totality of the invention：

1) after logging in system by user carries out user name password authentification first, system can call the OAuth of Sina weibo to verify Module carries out the authority checking of microblogging account；If user does not bind microblogging account, microblogging account can be carried out using OAuth Binding.

2) log in after the completion of authority checking, system can be updated to user data, by user-related micro-blog information Local persistence is carried out, to ensure that it is ageing that subsequent analysis work.

3) after data complete local persistence, meeting is by participle and part of speech filter to the user's association area in database Microblogging carries out participle and is filtered with part of speech, is worked in order to follow-up keyword extraction and keyword expansion.Here user is related Field microblogging refers mainly to the micro-blog information that bound enterprise's microblogging sent out, or some fields specified by user are related The micro-blog information that user is sent out, it might even be possible to be outside field correlation language material.The present invention is not intended to limit segmenter and part of speech mark Device is noted, in theory Chinese segmenter and part-of-speech tagging device.Chinese Academy of Sciences's meter has been used in the system corresponding to the present invention Researched and developed ICTCLAS5.0 participles and part-of-speech tagging system are calculated, and for the older problem of the segmenter dictionary, is introduced The outside dictionary data of segmenter of stammering.Pre-processed results are the history microblog data of user.

4) participle will give keyword extracting module with the field relevant microblog after part of speech filtering in the form of participle word frequency Keyword extraction is carried out, system carries out keyword extraction using the method for TextRank for Weibo, it is specific to extract result generation The table interest characteristics of this area.

If 5) User Defined interest keyword, energy is recalled in order to further enhance User Defined keyword Power, the system can carry out the pass based on realm information according to user-defined keyword using P-IOW keyword expansions method Keyword extends.Spreading result is stated in the form of participle vector.If user does not have self-defined interest keyword, the step will be exported Empty result.

6) keyword extraction based on TextRank for Weibo with based on P-IOW keyword expansion after the completion of, will The linear weighted function based on User Defined weight is carried out to its result to merge.User can be with self-defined keyword extraction and keyword Extend the proportion shared by respective result.Output form is the participle vector after merging, and the final field that it is used for representing user is emerging Interest.

7) step carries out the Similarity Measure of user interest and microblogging to be recommended, it is necessary first to divided microblogging to be recommended Word and part of speech filtering (same to step 3), in the form of being translated into participle word frequency vector.Then user field interest vector is calculated With the vector and the dot-product of participle IDF values after microblogging to be recommended conversion, discussion as detailed above.Product is microblogging to be recommended With the similarity of user field interest.

8) use the field microblogging text recommended as input, using the cluster of LDA model realization lexical items, form descriptor Item set, finally calculates the importance of theme lexical item, and user is presented to according to significance level.

The Organization Chart for realizing system corresponding to the present invention is as shown in figure 4, system is based on MySQL+JSP+Servlet+ The technology stack architecture of Twitter Bootstrap, is divided into data acquisition with pretreatment module, field keyword extracting module, use The self-defined keyword expansion module in family, linear combining module, Similarity Measure and personalized recommendation module and theme obtain mould The big part of block six.Targeted customer of the invention is the enterprise's microblog users on Sina weibo platform, is specified by analyzing it The history microblogging text of some field relevant microblog accounts, the domain features that system can be where automatic identification user.Meanwhile, it is System also supports user, and dynamically self-defined keyword states the dynamic need in field of oneself, and system can be according to keyword Co-occurrence information in history microblogging carries out keyword expansion, strengthens the semantic expressive faculty of keyword.Then, user can be with root The keyword weight defined according to oneself models the field interest of complete personalization.Finally, according to field interest realize to Subscribe to the filtered recommendation function and subject recommending function of microblogging in family.User is avoided to check the uncorrelated microblogging in a large amount of fields one by one, Improve the operating efficiency of enterprise marketing personnel.

Claims

1. a kind of realm information commending system based on microblog, it is characterised in that including：Data acquisition and pretreatment module, Field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalization are pushed away Recommend module and theme acquisition module；Wherein：

Data acquisition and pretreatment module：User's relevant microblog information data is obtained, and is pre-processed；Pretreatment includes data Stop words filtering, participle and part-of-speech tagging；Pre-processed results are the history microblog data of user, and the field keyword of being transmitted to is carried Modulus block；If User Defined field interest keyword, pre-processed results are simultaneously transmitted to the expansion of User Defined keyword Exhibition module；

Field keyword extracting module：Based on pre-processed results, keyword extraction is using based on TextRank algorithm modification TextRank for Weibo algorithms are carried out without guidance, and the algorithm includes the construction of the non-directed graph based on cooccurrence relation and is based on The node weights of figure calculate two stages；The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging The participle of appearance is converted into corresponding node；Between node connect side construction when, using whether have between node while and while Weight by two words, co-occurrence number of times in same piece microblogging judges the composition of co-occurrence, the weight on side is word same Co-occurrence number of times in microblogging, if two words co-occurrence in certain microblogging of user, between node corresponding to two words The weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging；Then the section of figure is based on again In the point weight calculation stage, the weight in each stage is iterated to calculate, untill the variable quantity of node weights converges to certain threshold values； After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to weight The result for being ranked up and obtaining keyword extraction is spent, so that the domain features where automatic identification user；

User Defined keyword expansion module：Co-occurrence based on keyword, distribution and its owning user attribute information come Calculate keyword between similarity, using the word of the degree of correlation high as target keyword spreading result；This module is supported to use Family is input into multiple self-defined keywords, and for each self-defined keyword, the extension term vector that can go out to keyword expansion is carried out It is linear plus and, so as to obtain final spread vector；User Defined keyword expansion function ensure that the dynamic interest of user Demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword；

Linear combining module：After the completion of field keyword is automatically extracted and extension based on User Defined keyword is equal, adopt Two result vectors are normalized with maximum method for normalizing, make the result vector of keyword extraction and keyword expansion It is mapped among a unified span；After normalization, the vector after being normalized to two carries out linear combining, merges Journey supports the weight of User Defined keyword extraction and keyword expansion；It is emerging that module output one represents the final field of user The term vector of interest；

Relatedness computation and personalized recommendation module：Linear combining module depict user field interest crucial term vector it Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, obtain the microblogging with user interest The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging, by calculating the domain correlation degree of each user's microblogging, presses It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, realizes the personalized field microblogging to user Recommend；

Theme acquisition module：It is input training LDA models, the lexical item point according to theme with the field microblogging text for recommending user Lexical item is clustered into theme by cloth；The user field interest key word item obtained in theme lexical item set and linear combining module is entered Row relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, so as to complete motif discovery and push away Recommend.

2. the realm information commending system based on microblog according to claim 1, it is characterised in that：The data are obtained Take as follows with pretreatment module implementation process：

(1) after User logs in microblog system, user's checking is carried out first, after being verified, automatically using associated by the user Microblog voucher is interacted with microblog, to verify legitimacy of the user identity in microblog；

(2) the relevant microblog text that acquisition user concern is subscribed to, the data structured ground that will be obtained using local data base is lasting Change is got up, to read at any time；

(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and the part of part-of-speech tagging three； For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then entered for microblogging scene Chinese word segmentation and part-of-speech tagging that row is optimized, carry out participle and part-of-speech tagging, together using segmenter product I CTCLAS5.0 When part of speech filtering was carried out to the result after user's microblogging participle before keyword extraction and keyword expansion, a reserved name Word.Pre-processed results data are the history microblog data of user.

3. the realm information commending system based on microblog according to claim 1, it is characterised in that：Close in the field In keyword extraction module, based on history microblog data, the weight in each stage is calculated according to the algorithm idea iteration of PageRank Calculate, formula is as follows：

T R (V_{i}) = (1 - d) + d \times \underset{V_{j} &Element; E (V_{i})}{Σ} \frac{w_{i j}}{Σ_{V_{k} &Element; E (V_{j})} w_{j k}} T R (V_{j})

Wherein：V_iIt is i-th node, TR (V_i) it is node V_iWeight, w_ijIt is node V_iAnd V_jBetween side weight；E(V_i) be V_iThe set on the side for being connected；D is the damped coefficient of iteration, is set to 0.85, can start iteration, iteration with arbitrary initial value Untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration less than specified Numerical value.

4. the realm information commending system based on microblog according to claim 1, it is characterised in that：The user is certainly Define in keyword expansion module, the similarity between keyword is calculated using improved P-IOW algorithms, implementation process is as follows：

P - I O W (t, s) = \{\begin{matrix} \frac{l o g (w f (s) + 1)}{l o g ({WF}_{t_{M}} + 1)} \times P - I O L o g W (t, s) & t &NotEqual; s \\ P - I O L o g W (t, s) & t = s \end{matrix} - - - (1)

Wherein：

Wherein：S is User Defined keyword；Wf (t) is the microblogging number comprising word t；t_MFor field correlation language material in by most a plurality of The word that microblogging is included,It is t_MThe bar number of place microblogging；It is the microblogging number not comprising word s；Wf (t ∧ s) is same When the microblogging number comprising word t and word s；N is user's microblogging sum；s_pIt is that codomain is the smoothing factor of (0,1) so that It is not in situation that divisor is zero when being zero；Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P-IOLogW Codomain in log (s_p) arrive log (1/s_p) between, the value of result of calculation P-IOW is bigger, declarer w and User Defined keyword S has field similarity higher, will assign the upper limit of P-IOLogW codomains, i.e. log for User Defined keyword in itself (1/s_p)。

5. the realm information commending system based on microblog according to claim 1, it is characterised in that：The user is certainly In defining keyword expansion module, the extension term vector gone out to keyword expansion carry out it is linear plus and, so as to obtain final expansion During exhibition vector, expansion process specific algorithm is as follows：

(1) to each User Defined keyword set keyword, first, all extensions of keyword are calculated based on P-IOW Word weight；

(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, form extension Term vector；

(3) all of extension term vector of linear superposition, obtains the output knot of final spread vector, i.e. keyword expansion module Really.

6. the realm information commending system based on microblog according to claim 1, it is characterised in that：The linear conjunction And the implementation process of module is as follows：

(1) place is normalized using maximum normalization method respectively to the vector that keyword extraction and keyword expansion are generated Reason；

v_normal=v/v_max

Wherein：V is the initial value of vectorial a certain component；v_maxMaximum in important for vector, normalizes by maximum Afterwards, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting merging：

V_combine=r × V_kw-extract+(1-r)×V_kw-expand

Wherein：R is user-defined merging weight proportion, V_kw-extractIt is the result vector of keyword extraction, V_kw-expandFor The result vector of keyword expansion generation, amalgamation result is the crucial term vector V for portraying user field interest_combine。

7. the realm information commending system based on microblog according to claim 1, it is characterised in that：Described is similar Degree is calculated with personalized recommendation module, and it is as follows that domain correlation degree calculates specific formula：

{Relevance}_{T} = V_{c o m b i n e} \cdot W \cdot V_{I D F} = \underset{t_{i} &Element; L}{Σ} v_{t_{i}} w_{t_{i}} I D F (t_{i})

Wherein：The crucial term vector of user field interest is V_combine；The corresponding word frequency vector of microblogging T to be recommended is W；Participle IDF vectors are V_IDF；L is the vector space dimension sum after user's history microblogging participle；WithRespectively V_combineWith t on W_i Corresponding component；IDF(t_i) it is t_iIDF values.

8. a kind of realm information based on microblog recommends method, it is characterised in that：It is divided into data acquisition with pretreatment, field Keyword extraction, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six Individual step, realizes as follows：

(1) obtaining user's relevant microblog information carries out data prediction；Pretreatment work is included using the method pair of pattern match Stop words is filtered；Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0；Pre-processed results are user's History microblog data；

(2) pre-processed results carry out field keyword extraction, and keyword extraction will be repaiied using the present invention based on TextRank algorithm The TextRank for Weibo algorithms for changing are carried out without guidance, and TextRank for Weibo algorithms are divided into based on cooccurrence relation Non-directed graph construction and node weights based on figure calculate two stages；The construction phase of the non-directed graph based on cooccurrence relation is first The participle occurred in user's history microblogging is first converted into corresponding node；Then to the result of each microblogging participle in, own The binary participle of appearance is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side；

Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank, until Untill the variable quantity of node weights converges to certain threshold values；

(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module；Keyword Extending will be carried out using the P-IOW algorithms based on P-IOLog algorithm improvements, using the co-occurrence based on keyword, be distributed and it The information such as the attribute of owning user calculate the similarity between keyword, using the word of the degree of correlation high as target keyword Spreading result；The multiple self-defined keywords of user input are supported simultaneously；For each self-defined keyword, keyword expansion module Extension term vector that it can be expanded carry out it is linear plus and, so as to obtain final spread vector；

(4) result of the two is carried out using maximum method for normalizing according to User Defined weight to two result vectors Normalization, is mapped among a unified span；After normalization, the vector after being normalized to two is carried out linearly Merge, merging process supports the weight of User Defined keyword extraction and keyword expansion, amalgamation result supplies associated recommendation mould Block carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended；

(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identifying User interest, dot product is carried out by user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors Computing, domain correlation degree is calculated using the method for vector space dot product；

(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, completes master The discovery of topic, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation Calculate, establish thematic importance, and user's recommendation is carried out according to thematic importance.