CN106776881A - A kind of realm information commending system and method based on microblog - Google Patents

A kind of realm information commending system and method based on microblog Download PDF

Info

Publication number
CN106776881A
CN106776881A CN201611075431.XA CN201611075431A CN106776881A CN 106776881 A CN106776881 A CN 106776881A CN 201611075431 A CN201611075431 A CN 201611075431A CN 106776881 A CN106776881 A CN 106776881A
Authority
CN
China
Prior art keywords
user
keyword
microblogging
module
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611075431.XA
Other languages
Chinese (zh)
Inventor
杨燕
王帅
徐良
徐罡
田申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN201611075431.XA priority Critical patent/CN106776881A/en
Publication of CN106776881A publication Critical patent/CN106776881A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a kind of realm information commending system based on microblog and method, including:Data acquisition and pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalized recommendation module and theme acquisition module;The present invention is directed to the characteristics design of microblog and realizes a kind of realm information and recommends method, keyword extraction and keyword expansion are carried out into seamless combination, so as to the extraction that both ensure that domain features in turn ensure that the dynamic of recommendation results, the experiment of Sina weibo is based on by correspondence system, the validity of this method is demonstrated.The present invention can the marketing of auxiliary enterprises microblogging, effectively improve the efficiency of enterprise microblogging marketing.

Description

A kind of realm information commending system and method based on microblog
Technical field
The present invention relates to a kind of field microblogging commending system under microblog and method, support that guideless field is special Extraction and User Defined keyword are levied, belongs to field of computer technology.
Background technology
As internet enters the rear WEB2.0 epoch, social functions have become the model of internet great change.Major social activities Website occurs and occupies the dominant position of internet rapidly like the mushrooms after rain, early in March, 2010, famous American social network Stand FacebookTMJust U.S. maximum website is leapt in visit capacity more than Google.At home, the emerging social media such as Sina weibo Also it is rapid to emerge, end on May 16th, 2012, the number of users of Sina weibo has reached 300,000,000, at home the use of Internet market The Tencent QQ product accumulated by more than ten years is only second in the scale of family.On the other hand, according to the most authoritative IT researchs in the whole world and Gu The consulting firm Gartner great strategy technical reports of IT industries in 2011 ten of issue are asked, the technology directly related with social activity is just Two are account for, is respectively Social Communications and Collaboration and Social Analytics.It is also Uniquely two technology major classes at seat are occupied in all technologies.Attention rate and its prospect of the people to social product as can be seen here.
On the other hand, people are used internet popularization and the enhancing that uses internet viscosity so that big data into For the focus of IT circles concern in recent years.But, big data is converted into the value useful to the mankind, it is necessary to data mining etc. The support of correlation technique.Therefore, data mining in recent years and the temperature of analysis are also to soar all the way.Especially enterprise-oriented number According to excavating and analyzing, because it can bring direct interests for enterprise.And based on dividing that a large number of users True Data is produced Analysis result has stronger reliability and convincingness compared to traditional analytical technology, and this is a kind of analysis side in real time Method, can better adapt to turn of the market, preferably catch the market opportunity.
Although microblog contains various realm informations, and for field event reaction quickly, thereon Obtain more comprehensive realm information and still face many difficulties.The rise of microblog and the rapid growth of user bring letter Breath overload problem, the increase of quantity is paid close attention to user, and the content unrelated with field is also increasingly appearing in user's subscription Microblogging in;Meanwhile, if user only focuses on a small amount of field associated user high, will cause obtain in time comprehensively Realm information.I.e. data acquisition of the user in microblog has that accuracy rate can not get both with recall rate.And it is micro- Rich platform also has theme dispersion and the characteristics of information fragmentation in itself, and this requires a kind of enterprise's microblog users that are capable of identify that Field interest, and the method that micro-blog information is extracted and recommended according to domain correlation degree.However, existing social media pipe Reason carries out simple word just with User Defined keyword mostly with analysis software in terms of realm information extraction Match somebody with somebody, this method has very big defect.First, individual keywords can not comprehensively portray realm information demand;Secondly, language It is rich so that simple characters matching effect is limited.
It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword All there is respective defect in the method for interest.
Keyword extracting method is mainly manifested in:This is a kind of guideless extraction algorithm, and user cannot dynamically adjust calculation Method result, therefore interim dynamic need of the user for certain field theme can not be met.And algorithm is in user's history microblogging Field interest modeling can not be carried out in the case of less, i.e., so-called cold start-up phenomenon.
The defect of User Defined keyword is mainly manifested in:First, user-defined keyword is it is difficult to ensure that completely Cover all information of this area;Second, in microblogging short text, judging that similitude will just with whether keyword occurs So that many different information of synonymous but word cannot be extracted.
On the other hand, related research work is applied and respective problem is there is also in this application scene.First, social network The rise of network is that data mining and data analysis correlation technique provide the new application scenarios being of great value, social networks For data mining provides new visual angle with analysis, also cause that traditional data mining technology is faced with new challenges.
In sum, above method is applied in the reality system of microblog, be there are problems that following several big:
(1) keyword extraction and keyword expansion there are problems that respective, and pure strategy cannot well meet micro- The demand that rich platform domain dependant information is recommended.
(2) in related research work, because the characteristics of information fragmentation, a class algorithm is in training language in social platform Material aspect needs to be carried out by outside language material.This greatly reduces the practicality and portability of algorithm.
(3) another kind of algorithm is based on global language material carries out setting up model, and such algorithm calculates cost very greatly and without general All over applicability.
Therefore a kind of recommendation method is needed, keyword can be based on, there is provided realm information personalized recommendation, to help user Rapidly and accurately obtain domain dependant information.The method needs to solve enterprise customer in microblog and obtains domain dependant information mistake The contradiction that accuracy rate can not get both with recall rate in journey.Requirement based on practicality simultaneously, the algorithm cannot rely upon outside language Material, and need to be calculated based on user partial data.It is therefore proposed that a kind of recommendation method with above property, is this The focus of invention.
The content of the invention
It is an object of the invention to:Overcome the deficiencies in the prior art, there is provided a kind of field based on microblog Information recommendation system and method, the history microblogging based on user, it is proposed that keyword extraction is combined to build with keyword expansion The method of mould user interest, not only ensure that comprehensive identification of realm information but also had allowed users to dynamically adjust oneself according to demand Field interest;Using the keyword extraction algorithm TextRank based on figure, other language materials are not relied on, and avoid extraction Result is influenceed by zipf law phenomenon present in language model, and proposes that a kind of P-IOW algorithms of optimization realize key Word preferably extends.This method ensure that the dynamic interest demand of user can in real time be met and greatly enhance user The expressive faculty of self-defined keyword.Carried out linearly according to User Defined weight with the result of extension by by keyword extraction Merge, can be user-customized recommended association area micro-blog information and theme, help user rapidly and accurately to obtain field phase Pass information.
The technology of the present invention solution:A kind of realm information commending system based on microblog, including:Data acquisition with Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter Calculate and personalized recommendation module and theme acquisition module;Wherein:
Data acquisition and pretreatment module:User's relevant microblog information data is obtained, and is pre-processed;Pretreatment includes The stop words filtering of data, participle and part-of-speech tagging;Pre-processed results are the history microblog data of user, are transmitted to field crucial Word extraction module;If User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined Word expansion module;
Field keyword extracting module:Based on pre-processed results, keyword extraction is used based on TextRank algorithm modification TextRank for Weibo algorithms without instruct carry out, the algorithm include based on cooccurrence relation non-directed graph construction and base Two stages are calculated in the node weights of figure;The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging The participle of middle appearance is converted into corresponding node;Between node connect side construction when, using whether have between node side and Co-occurrence number of times of the weight on side by two words in same piece microblogging judges the composition of co-occurrence, and the weight on side is word same Co-occurrence number of times in one microblogging, if two words co-occurrence in certain microblogging of user, node corresponding to two words it Between the weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging;Then figure is based on again Node weights calculation stages, iterate to calculate the weight in each stage, are until the variable quantity of node weights converges to certain threshold values Only;After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to Importance degree is ranked up the result for obtaining keyword extraction, so that the domain features where automatic identification user;
User Defined keyword expansion module:The attribute letter of the co-occurrence based on keyword, distribution and its owning user Breath calculates the similarity between keyword, using the word of the degree of correlation high as target keyword spreading result;This module branch The multiple self-defined keywords of user input are held, for each self-defined keyword, the extension term vector that can go out to keyword expansion Carry out it is linear plus and, so as to obtain final spread vector;User Defined keyword expansion function ensure that the dynamic of user Interest demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword;
Linear combining module:In field, keyword is automatically extracted and the extension based on User Defined keyword is completed Afterwards, two result vectors are normalized using maximum method for normalizing, make the knot of keyword extraction and keyword expansion Fruit DUAL PROBLEMS OF VECTOR MAPPING is among a unified span;After normalization, the vector after being normalized to two carries out linear combining, Merging process supports the weight of User Defined keyword extraction and keyword expansion;It is final that module output one represents user The term vector of field interest;
Relatedness computation and personalized recommendation module:Linear combining module depicts the crucial term vector of user field interest Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered to generate word frequency vector, then by user interest keyword The word frequency vector and IDF information vectors of microblogging generation vectorial, to be recommended carry out point multiplication operation, obtain the microblogging and user interest The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging.By calculating the domain correlation degree of each user's microblogging, It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, is realized micro- to the personalized field of user It is rich to recommend;
Theme acquisition module:It is input training LDA models with the field microblogging text for recommending user, according to the word of theme Lexical item is clustered into theme by item distribution;The user field interest keyword that will be obtained in theme lexical item set and linear combining module Item carries out relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, thus complete motif discovery and Recommend.
The data acquisition is as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first, after being verified, closed using the user automatically The microblog voucher of connection is interacted with microblog, to verify legitimacy of the user identity in microblog;
(2) user's relevant microblog information data is obtained, the data structured ground persistence that will be obtained using local data base Get up, to read at any time;
(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and part-of-speech tagging three Point;For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then for microblogging Chinese word segmentation and part-of-speech tagging that scape has been optimized, participle and part of speech mark are carried out using segmenter product I CTCLAS5.0 Note, while carrying out part of speech filtering to the result after user's microblogging participle before keyword extraction and keyword expansion, only protects Leave behind a good reputation word.Pre-processed results are the history microblog data of user.
In the field keyword extracting module, the weight in each stage is calculated according to the algorithm idea iteration of PageRank Calculate, formula is as follows:
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E (Vi) it is ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start to change with arbitrary initial value In generation, untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration to iteration Less than specified numerical value.
In the User Defined keyword expansion module, the phase between keyword is calculated using improved P-IOW algorithms Like spending, implementation process is as follows:
The computational methods of the User Defined domain correlation degree of keyword s, word t on s for giving are as follows:
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMIt is quilt in the correlation language material of field The word that most a plurality of microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;wf(t∧ S) it is while the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that the smoothing factor that codomain is (0,1) is causedIt is not in situation that divisor is zero when being zero.Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P- The codomain of IOLogW is in log (sp) arrive log (1/sp) between.The value of result of calculation P-IOW is bigger, declarer w and User Defined Keyword s has field similarity higher.For User Defined keyword in itself, the upper of P-IOLogW codomains will be assigned Limit, i.e. log (1/sp)。
In the User Defined keyword expansion module, the extension term vector gone out to keyword expansion linearly add With so as to obtain during final spread vector, expansion process specific algorithm is as follows:
(1) to each User Defined keyword set keyword, first, all of keyword are calculated based on P-IOW Expansion word weight;
(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed Extension term vector;
(3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module As a result.
The implementation process of the linear combining module is as follows:
(1) vector generated to keyword extraction and keyword expansion using maximum normalization method is normalized respectively Treatment;
(2) for vector in each component, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector, returns by maximum One change after, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting conjunction And:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-extractIt is the result vector of keyword extraction, Vkw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest Vcombine
In described Similarity Measure and personalized recommendation module, it is as follows that domain correlation degree calculates specific formula:
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Point The IDF vectors of word are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith The upper t of WiCorresponding component;IDF(ti) it is tiIDF values.
A kind of realm information based on microblog recommends method, is divided into data acquisition and is carried with pretreatment, field keyword Take, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six steps, Realize as follows:
(1) obtaining user's relevant microblog information carries out data prediction;Pretreatment work is included using the side of pattern match Method is filtered to stop words;Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0;Pre-processed results are use The history microblog data at family.
(2) pre-processed results carry out field keyword extraction, and keyword extraction will be calculated using the present invention based on TextRank The TextRank for Weibo algorithms of method modification are carried out without guidance;The process is divided into the structure of the non-directed graph based on cooccurrence relation Make and the node weights based on figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation is first by user's history The participle occurred in microblogging is converted into corresponding node;Then to the result of each microblogging participle in, the binary for being occurred point Word is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side;
Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank, Untill the variable quantity of node weights converges to certain threshold values;
(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module;Close Keyword extension will be carried out using based on the P-IOW algorithms of P-IOLog algorithm improvements, using the co-occurrence based on keyword, distribution with And its information such as the attribute of owning user calculates the similarity between keyword, using the word of the degree of correlation high as target critical The spreading result of word;The multiple self-defined keywords of user input are supported simultaneously;For each self-defined keyword, keyword expansion The extension term vector that module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector;
(4) by the result of the two according to User Defined weight, using maximum method for normalizing to two result vectors It is normalized, is mapped among a unified span;After normalization, the vector after being normalized to two is carried out Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion, and amalgamation result is pushed away for correlation Recommending module carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended;
(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identification The user interest for going out, user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out Point multiplication operation, domain correlation degree is calculated using the method for vector space dot product;
(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, complete Into the discovery of theme, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into phase Guan Du is calculated, and establishes thematic importance, and carry out user's recommendation according to thematic importance.
Present invention advantage compared with prior art is:
(1) the source aspect of field language material, there are problems that for keyword extraction and keyword expansion method it is respective, Pure strategy cannot well meet microblog domain dependant information and recommend this problem, the present invention to propose to field language The method that material is blended using the keyword extraction based on figure and the User Defined keyword expansion technology based on co-occurrence information Model the field interest of user, it is ensured that the dynamic interest demand of user can in real time be met and greatly enhance user The expressive faculty of self-defined keyword, changes so as to take into account portraying for realm information comprehensively with the dynamic of user interest, solves Field user obtains the contradiction that accuracy rate during domain dependant information can not get both with recall rate in microblog.
(2) while the requirement based on practicality, the algorithm in the present invention does not rely on outside language material, without the use of extensive Language material is analyzed, but the history microblog data based on user partial data, i.e. user is calculated, thus algorithm have compared with The fast reaction time, with bigger practicality and portability.
(3) theme modeling is carried out again for the field microblogging text for recommending user, filter out the dry of field unrelated subject matter Disturb, realize more accurately subject recommending, further increase the accuracy of realm information recommendation.
Brief description of the drawings
Fig. 1 is system assumption diagram of the invention;
Fig. 2 is system overview flow chart of the invention;
Fig. 3 is system sequence figure of the invention;
Fig. 4 is system framework figure of the invention;
Fig. 5 is keyword extraction submodule uml diagram major part selected parts of the present invention based on figure;
Fig. 6 is keyword expansion submodule uml diagram major part selected parts of the present invention based on co-occurrence information.
Specific embodiment
Below in conjunction with specific embodiments and the drawings, the present invention is described in detail.
As shown in figure 1, a kind of realm information commending system based on microblog of the invention, including:Data acquisition with Pretreatment module, field keyword extracting module, User Defined keyword expansion module, linear combining module, similarity meter Calculate and personalized recommendation module and theme acquisition module.
Data acquisition and pretreatment module:For the enterprise's microblog users on Sina weibo platform, specified by analyzing it Some fields in relevant microblog account history microblogging text, using Sina weibo open platform obtain user's relevant historical it is micro- Rich information completes data prediction work simultaneously, i.e., using the method for pattern match, stop words is filtered;Use participle system System ICTCLAS5.0 carries out participle and part-of-speech tagging.Pre-processed results are the history microblog data of user, are transmitted to field crucial Word extraction module, if User Defined field interest keyword, it is crucial that pre-processed results are simultaneously transmitted to User Defined Word expansion module.
Field keyword extracting module:According to pre-processed results, the non-directed graph based on cooccurrence relation is carried out first and is constructed, will The participle occurred in user's history microblogging is converted into corresponding node, to the binary occurred in the result of each microblogging participle Participle is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side.Then it is based on again The node weights of figure are calculated, and the weight in each stage are iterated to calculate according to the algorithm idea of PageRank, until node weights Untill variable quantity converges to certain threshold values.After iteration terminates, the weight of each node is the important journey of the participle representated by it Degree.The result of keyword extraction is obtained by all participles of user are ranked up according to importance degree, so that automatic identification is used Domain features where family.
User Defined keyword expansion module:Use improved P-IOW (Probabilistic Inside-Outside Log for Weibo) method, the information such as attribute of the co-occurrence based on keyword, distribution and its owning user calculates key Similarity between word, using the word of the degree of correlation high as target keyword spreading result.This module supports that user input is more Individual self-defined keyword.For each self-defined keyword, the extension term vector that keyword expansion module can be expanded to it enters Line add and, so as to obtain final spread vector.User Defined keyword expansion function ensure that the dynamic of user is emerging Interesting demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword.
Linear combining module:In field, keyword is automatically extracted and the extension based on User Defined keyword is completed Afterwards, two result vectors are normalized using maximum method for normalizing, are mapped to a unified span Among.After normalization, the vector after being normalized to two carries out linear combining, and merging process supports that User Defined keyword is carried Take the weight with keyword expansion.Module exports a term vector for representing the final field interest of user.
Relatedness computation and personalized recommendation module:After the crucial term vector for portraying user field interest is generated, phase Guan Du to be calculated and will carry out participle and word frequency statisticses to every microblogging to be filtered with personalized recommendation module to generate word frequency vector, Then user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors are carried out into point multiplication operation, To obtain the degree of correlation of the microblogging and user interest.By calculating the domain correlation degree of each user's microblogging, according to field phase Guan Du is ranked up from high to low, realizes recommending the personalized field microblogging of user.
Theme acquisition module:It is input training LDA models with the field microblogging text for recommending user, according to the word of theme Lexical item is clustered into theme by item distribution;The user field interest keyword that will be obtained in theme lexical item set and linear combining module Item carries out relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, thus complete motif discovery and Recommend.
As shown in Fig. 2 data acquisition is as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first.After being verified, system can automatically use the use Sina weibo open platform OAuth2.0 vouchers associated by family interact to verify user identity in Sina weibo with open platform Legitimacy on platform.If voucher does not have expired, user's checking work is completed.If voucher does not exist or has passed through Phase, then system can be automatically brought to open platform OAuth2.0 checking the page, the page request user input its in Sina weibo On user name and password.After user input correct information, open platform can pass the voucher after renewal back the system, the system By the persistence voucher to ensure in the voucher term of validity, user only can log in open platform with the system username and password And obtain microblog data and carry out associative operation.
(2) the microblogging text data for obtaining is got up the data structured ground persistence of acquisition using local data base, with Just upper strata analysis module reads at any time.In terms of data renewal, this module supports the update method of increment type.Update every time only The relevant microblog information that transmission user newly increases, so as to improve the response speed of system, saves the network bandwidth to greatest extent.
(3) the microblogging text next to persistence carries out pretreatment work, including stop words filtering, participle and part of speech mark The part of note three.Microblogging stop words mainly includes following form:The topic label of " # topic words # " form, " user name " form URL link included in orientation notice, the emoticon of " [expression word] " form and microblogging to certain user etc..For micro- These stop words, using the method for pattern match, are filtered by rich text characteristics first.Then carried out for microblogging scene The Chinese word segmentation and part-of-speech tagging for optimizing.Participle is text processor specific to the minority language such as Chinese, because in Text possesses obvious separator unlike other most of language.In natural language processing, participle is to be converted into herein Computer it will be appreciated that form inevitable operation.Especially present invention employs vector space model, participle is even more indispensable The step of.And, the quality of word segmentation result will directly affect the quality of arithmetic result.Part-of-speech tagging refers to in given sentence Each word assigns correct lexical token, is one highly useful pretreated for follow-up natural language processing work Journey.This module carries out participle and part-of-speech tagging using segmenter product I CTCLAS5.0, because the neologisms language material for importing also includes Part-of-speech tagging information, therefore can't influence the accuracy of part-of-speech tagging.Field correlation on to Sina weibo platform is micro- Rich investigation finds that noun can accurately portray user field interest, and the word of other parts of speech is past as keyword It is past to introduce ambiguity and cause to recommend the decline of accuracy rate.Therefore it is micro- to user before keyword extraction with keyword expansion Result after rich participle has carried out part of speech filtering, only retains noun.Pre-processed results are the history microblog data of user.
As shown in Fig. 2 field keyword extracting module implementation process is as follows:
(1) present invention compares from TextRank algorithm relatively advanced at present in terms of keyword extraction by experiment, Reason is as follows:TextRank algorithm overcomes the defect of TFIDF methods, and it need not calculate TF information, therefore need not be by microblogging Merge, and it is independent of exterior I DF information.TextRank considers common with domain correlation degree keyword high when calculating Existing word is with this hypothesis of domain correlation degree higher so that the calculating of keyword weight is no longer linear calculating, so that The power-law distribution problem of TFMF algorithms is overcome to a certain extent.Therefore its relative other algorithm is more suitable for this application scene.
(2) this module has drawn TextRank keyword extraction algorithm thoughts, and characteristic with reference to microblogging application scenarios is carried TextRank for Weibo algorithms are gone out.TextRank for Weibo algorithms are the keyword extraction algorithms based on figure.Its Inspiration Sources are in PageRank algorithm ideas.In terms of the construction of figure, traditional TextRank algorithm is based on word in a document Co-occurrence number of times in the sliding window of regular length defines the weight on the connection side between word.In view of the spy of microblogging short text Property, the present invention uses co-occurrence number of times of the word in a microblogging as the weight on side between word.Make on this non-directed graph afterwards Weight of each word as keyword is calculated with PageRank algorithms.The present invention defines one point of each node on behalf in figure Word, if two words co-occurrence in certain microblogging of user, the weights on the side between its corresponding node add 1, between node Final weights be its co-occurrence number of times of two words of correspondence in microblogging.
After the non-directed graph determines, the weight of each node is produced using the algorithm idea iteration similar to PageRank.Section Point ViWeights be updated according to equation below:
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E (Vi) it is ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start to change with arbitrary initial value In generation, untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration to iteration Less than specified numerical value.
Fig. 5 is shown in the uml class figure major part selected parts of the module, be broadly divided into the non-directed graph based on cooccurrence relation construction and Node weights based on figure calculate two stages.
It is right that be converted into for the participle occurred in user's history microblogging first by the construction phase of the non-directed graph based on cooccurrence relation The node answered.Then to the result of each microblogging participle in, the binary participle for being occurred is to carrying out the construction on side, the weights on side As equivalent is to the co-occurrence number of times in history microblogging.
As shown in Fig. 2 User Defined keyword expansion module implementation process is as follows:
(1) present invention introduces User Defined keyword to strengthen based on the not enough defect of keyword extraction dynamic The dynamic of interest modeling.Simultaneously in order to solve the self-defined hypodynamic problem of antistop list Danone, the present invention proposes key Word extraction is modeled with the method that User Defined keyword expansion is combined to user interest.
(2) it is in view of application scenarios demand of the invention:Under the related language material background in field, will be user-defined Some field associative keys are extended based on domain correlation degree.Therefore present invention employs word in the correlation language material of field Co-occurrence information calculate word between similarity.This method be based on User Defined keyword in same microblogging co-occurrence Word has the hypothesis of stronger field similarity with the keyword.
(3) present invention has received the extended method P-IOLog for topic label, it is contemplated that the expansion of the method generation The confidence level for opening up word weight is that frequency of the self-defined keyword in language material to be analyzed is directly proportional to the size of sample space, that is, exist In the case that microblogging number comprising User Defined keyword is less, the spreading result error of P-IOLog algorithms generation is larger. Therefore the present invention is improved P-IOLog for this application scene, introduces drop weight factor, the factor is with User Defined The increase of the frequency of occurrence of keyword and increase;Eliminate consideration of the former method for subject layer simultaneously, it is proposed that Probabilistic Inside-Outside Log for Weibo (abbreviation P-IOW) method.
(4) specific method is as follows, the meter of the User Defined domain correlation degree of keyword s, word t on s for giving Calculation method is as follows:
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMIt is quilt in the correlation language material of field The word that most a plurality of microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;wf(t∧ S) it is while the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that the smoothing factor that codomain is (0,1) is causedIt is not in situation that divisor is zero when being zero.Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P- The codomain of IOLogW is in log (sp) arrive log (1/sp) between.The value of result of calculation P-IOW is bigger, declarer w and User Defined Keyword s has field similarity higher.For User Defined keyword in itself, the upper of P-IOLogW codomains will be assigned Limit, i.e. log (1/sp)。
(5) while, the present invention supports the multiple self-defined keywords of user input.It is crucial for each self-defined keyword The extension term vector that word expansion module can be expanded to it carry out it is linear plus and, so as to obtain final spread vector.Propagate through Journey specific algorithm is as follows:
1) to each User Defined keyword set keyword, first, all expansions of keyword are calculated based on P-IOW Exhibition word weight;
2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, are formed and expanded Exhibition term vector;
3) all of extension term vector of linear superposition, obtains final spread vector, the i.e. output of keyword expansion module As a result.
Fig. 6 is shown in the uml class figure major part selected parts of the module, the co-occurrence based on other participles Yu User Defined keyword Information is extended to User Defined keyword, and spreading result is represented in vector form such that it is able to seamlessly with key Word extracts result and merges.
As shown in Fig. 2 linear combining module implementation process is as follows:
(1) it refers to be automatically extracted in keyword and based on User Defined to carry out linear combining according to User Defined weight After the completion of the extension of keyword is equal, the process that two result vectors are merged.
It is traditional the method for user interest to be modeled by keyword extraction and user is represented by User Defined keyword All there is respective defect in the method for interest.Keyword extraction aspect is mainly manifested in:User cannot manually adjust arithmetic result, The interim dynamic need for certain field theme of user can not be met.And algorithm is in the case where user's history microblogging is less Field interest modeling can not be carried out, i.e., so-called cold start-up phenomenon.The defect of User Defined keyword is mainly manifested in:Its One, user-defined keyword are it is difficult to ensure that be fully contemplated by all information of this area;Second, in microblogging short text, only Judge similitude by so that many different information of synonymous but word cannot be extracted merely with whether keyword occurs.
In view of the defect of both the above method, proposition field of the present invention keyword automatically extracts crucial with User Defined The method that word mutually merges, and the co-occurrence information that User Defined keyword is based on word is extended.This method was both taken into account Comprehensive identification to realm information, can dynamically adjust keyword at any time according to the change of user interest again;Meanwhile, user makes by oneself Adopted keyword can to a certain extent solve cold start-up phenomenon.
(2) because the result vector of keyword extraction and keyword expansion not within a span, it is necessary to will The two DUAL PROBLEMS OF VECTOR MAPPINGs are among same span.This module employs maximum method for normalizing to two result vectors It is normalized, is mapped among a unified span.After normalization, the vector after being normalized to two is carried out Linear combining, merging process supports the weight of User Defined keyword extraction and keyword expansion.Module exports a representative The term vector of the final field interest of user.
Each component in for vector, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector.Return by maximum After one changes, vector institute it is important (0,1] between, and important be non-zero.SYSTEM OF LINEAR VECTOR weighting is carried out afterwards to close And:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-exractIt is the result vector of keyword extraction, Vkw-expandIt is the result vector of keyword expansion generation, amalgamation result is the crucial term vector for portraying user field interest Vcombine
As shown in Fig. 2 Similarity Measure is as follows with personalized recommendation module implementation process:
(1) the crucial term vector V that portrays user field interest is being completedcombineGeneration after, the degree of correlation compares mould Block will carry out participle and word frequency statisticses to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, to obtain the microblogging and user interest The degree of correlation.The specific formula of relatedness computation is as follows:
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Point The IDF vectors of word are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith The upper t of WiCorresponding component;IDF(ti) it is tiIDF values.
(2) final, model calculates the domain correlation degree of each user's microblogging.System can be according to the domain correlation degree for having obtained Carry out microblogging recommendation.The form of recommendation is not limited, and can be from high to low ranked up according to domain correlation degree, or filter out field The degree of correlation is more than micro-blog information of specified threshold etc..Present invention employs the first rendering method.
As shown in Fig. 2 theme acquisition module implementation process is as follows:
(1) Text Pretreatment work is carried out to the field microblogging recommended.Pretreatment work is with module 1, including stop words mistake Filter, participle and the part of part-of-speech tagging three.
(2) with the data handled well as input training LDA models, lexical item is clustered into master by the lexical item distribution according to theme Topic;
(3) the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation Calculate, basis is to weigh correlation using the co-occurrence information between lexical item, after giving user field interest keyword, institute There is the theme lexical item set containing the keyword to be counted as correlation, the theme without any field interest keyword is counted as It is the small or incoherent degree of correlation, specific algorithm uses the relatedness computation method of module 5.
(4) degree of correlation represents the significance level of theme, and theme is ranked up by importance, and is presented to user.
The present invention devises prototype system to carry out result verification based on Sina weibo open platform, during system of the invention Sequence figure uses flow as shown in figure 3, it illustrates totality of the invention:
1) after logging in system by user carries out user name password authentification first, system can call the OAuth of Sina weibo to verify Module carries out the authority checking of microblogging account;If user does not bind microblogging account, microblogging account can be carried out using OAuth Binding.
2) log in after the completion of authority checking, system can be updated to user data, by user-related micro-blog information Local persistence is carried out, to ensure that it is ageing that subsequent analysis work.
3) after data complete local persistence, meeting is by participle and part of speech filter to the user's association area in database Microblogging carries out participle and is filtered with part of speech, is worked in order to follow-up keyword extraction and keyword expansion.Here user is related Field microblogging refers mainly to the micro-blog information that bound enterprise's microblogging sent out, or some fields specified by user are related The micro-blog information that user is sent out, it might even be possible to be outside field correlation language material.The present invention is not intended to limit segmenter and part of speech mark Device is noted, in theory Chinese segmenter and part-of-speech tagging device.Chinese Academy of Sciences's meter has been used in the system corresponding to the present invention Researched and developed ICTCLAS5.0 participles and part-of-speech tagging system are calculated, and for the older problem of the segmenter dictionary, is introduced The outside dictionary data of segmenter of stammering.Pre-processed results are the history microblog data of user.
4) participle will give keyword extracting module with the field relevant microblog after part of speech filtering in the form of participle word frequency Keyword extraction is carried out, system carries out keyword extraction using the method for TextRank for Weibo, it is specific to extract result generation The table interest characteristics of this area.
If 5) User Defined interest keyword, energy is recalled in order to further enhance User Defined keyword Power, the system can carry out the pass based on realm information according to user-defined keyword using P-IOW keyword expansions method Keyword extends.Spreading result is stated in the form of participle vector.If user does not have self-defined interest keyword, the step will be exported Empty result.
6) keyword extraction based on TextRank for Weibo with based on P-IOW keyword expansion after the completion of, will The linear weighted function based on User Defined weight is carried out to its result to merge.User can be with self-defined keyword extraction and keyword Extend the proportion shared by respective result.Output form is the participle vector after merging, and the final field that it is used for representing user is emerging Interest.
7) step carries out the Similarity Measure of user interest and microblogging to be recommended, it is necessary first to divided microblogging to be recommended Word and part of speech filtering (same to step 3), in the form of being translated into participle word frequency vector.Then user field interest vector is calculated With the vector and the dot-product of participle IDF values after microblogging to be recommended conversion, discussion as detailed above.Product is microblogging to be recommended With the similarity of user field interest.
8) use the field microblogging text recommended as input, using the cluster of LDA model realization lexical items, form descriptor Item set, finally calculates the importance of theme lexical item, and user is presented to according to significance level.
The Organization Chart for realizing system corresponding to the present invention is as shown in figure 4, system is based on MySQL+JSP+Servlet+ The technology stack architecture of Twitter Bootstrap, is divided into data acquisition with pretreatment module, field keyword extracting module, use The self-defined keyword expansion module in family, linear combining module, Similarity Measure and personalized recommendation module and theme obtain mould The big part of block six.Targeted customer of the invention is the enterprise's microblog users on Sina weibo platform, is specified by analyzing it The history microblogging text of some field relevant microblog accounts, the domain features that system can be where automatic identification user.Meanwhile, it is System also supports user, and dynamically self-defined keyword states the dynamic need in field of oneself, and system can be according to keyword Co-occurrence information in history microblogging carries out keyword expansion, strengthens the semantic expressive faculty of keyword.Then, user can be with root The keyword weight defined according to oneself models the field interest of complete personalization.Finally, according to field interest realize to Subscribe to the filtered recommendation function and subject recommending function of microblogging in family.User is avoided to check the uncorrelated microblogging in a large amount of fields one by one, Improve the operating efficiency of enterprise marketing personnel.

Claims (8)

1. a kind of realm information commending system based on microblog, it is characterised in that including:Data acquisition and pretreatment module, Field keyword extracting module, User Defined keyword expansion module, linear combining module, Similarity Measure and personalization are pushed away Recommend module and theme acquisition module;Wherein:
Data acquisition and pretreatment module:User's relevant microblog information data is obtained, and is pre-processed;Pretreatment includes data Stop words filtering, participle and part-of-speech tagging;Pre-processed results are the history microblog data of user, and the field keyword of being transmitted to is carried Modulus block;If User Defined field interest keyword, pre-processed results are simultaneously transmitted to the expansion of User Defined keyword Exhibition module;
Field keyword extracting module:Based on pre-processed results, keyword extraction is using based on TextRank algorithm modification TextRank for Weibo algorithms are carried out without guidance, and the algorithm includes the construction of the non-directed graph based on cooccurrence relation and is based on The node weights of figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation, first by user's history microblogging The participle of appearance is converted into corresponding node;Between node connect side construction when, using whether have between node while and while Weight by two words, co-occurrence number of times in same piece microblogging judges the composition of co-occurrence, the weight on side is word same Co-occurrence number of times in microblogging, if two words co-occurrence in certain microblogging of user, between node corresponding to two words The weights on side add 1, the final weights on side are its co-occurrence number of times of two words of correspondence in microblogging;Then the section of figure is based on again In the point weight calculation stage, the weight in each stage is iterated to calculate, untill the variable quantity of node weights converges to certain threshold values; After iteration terminates, the weight of each node is the significance level of the participle representated by it, by all participles of user according to weight The result for being ranked up and obtaining keyword extraction is spent, so that the domain features where automatic identification user;
User Defined keyword expansion module:Co-occurrence based on keyword, distribution and its owning user attribute information come Calculate keyword between similarity, using the word of the degree of correlation high as target keyword spreading result;This module is supported to use Family is input into multiple self-defined keywords, and for each self-defined keyword, the extension term vector that can go out to keyword expansion is carried out It is linear plus and, so as to obtain final spread vector;User Defined keyword expansion function ensure that the dynamic interest of user Demand can be met in real time, while greatly enhancing the expressive faculty of User Defined keyword;
Linear combining module:After the completion of field keyword is automatically extracted and extension based on User Defined keyword is equal, adopt Two result vectors are normalized with maximum method for normalizing, make the result vector of keyword extraction and keyword expansion It is mapped among a unified span;After normalization, the vector after being normalized to two carries out linear combining, merges Journey supports the weight of User Defined keyword extraction and keyword expansion;It is emerging that module output one represents the final field of user The term vector of interest;
Relatedness computation and personalized recommendation module:Linear combining module depict user field interest crucial term vector it Afterwards, participle and word frequency statisticses are carried out to every microblogging to be filtered with generate word frequency vector, then by user interest keyword to Amount, the word frequency vector of microblogging to be recommended generation and IDF information vectors carry out point multiplication operation, obtain the microblogging with user interest The degree of correlation, the degree of correlation is the domain correlation degree of this microblogging, by calculating the domain correlation degree of each user's microblogging, presses It is ranked up from high to low according to domain correlation degree, micro-blog information is presented to user, realizes the personalized field microblogging to user Recommend;
Theme acquisition module:It is input training LDA models, the lexical item point according to theme with the field microblogging text for recommending user Lexical item is clustered into theme by cloth;The user field interest key word item obtained in theme lexical item set and linear combining module is entered Row relatedness computation, obtains thematic importance, and is presented to user according to importance ranking, so as to complete motif discovery and push away Recommend.
2. the realm information commending system based on microblog according to claim 1, it is characterised in that:The data are obtained Take as follows with pretreatment module implementation process:
(1) after User logs in microblog system, user's checking is carried out first, after being verified, automatically using associated by the user Microblog voucher is interacted with microblog, to verify legitimacy of the user identity in microblog;
(2) the relevant microblog text that acquisition user concern is subscribed to, the data structured ground that will be obtained using local data base is lasting Change is got up, to read at any time;
(3) the microblogging text to persistence carries out pretreatment work, including stop words filtering, participle and the part of part-of-speech tagging three; For microblogging text characteristics, using the method for pattern match, stop words is filtered first, then entered for microblogging scene Chinese word segmentation and part-of-speech tagging that row is optimized, carry out participle and part-of-speech tagging, together using segmenter product I CTCLAS5.0 When part of speech filtering was carried out to the result after user's microblogging participle before keyword extraction and keyword expansion, a reserved name Word.Pre-processed results data are the history microblog data of user.
3. the realm information commending system based on microblog according to claim 1, it is characterised in that:Close in the field In keyword extraction module, based on history microblog data, the weight in each stage is calculated according to the algorithm idea iteration of PageRank Calculate, formula is as follows:
T R ( V i ) = ( 1 - d ) + d × Σ V j ∈ E ( V i ) w i j Σ V k ∈ E ( V j ) w j k T R ( V j )
Wherein:ViIt is i-th node, TR (Vi) it is node ViWeight, wijIt is node ViAnd VjBetween side weight;E(Vi) be ViThe set on the side for being connected;D is the damped coefficient of iteration, is set to 0.85, can start iteration, iteration with arbitrary initial value Untill convergence, convergent condition is the absolute difference of each node weights sum between current iteration and last iteration less than specified Numerical value.
4. the realm information commending system based on microblog according to claim 1, it is characterised in that:The user is certainly Define in keyword expansion module, the similarity between keyword is calculated using improved P-IOW algorithms, implementation process is as follows:
The computational methods of the User Defined domain correlation degree of keyword s, word t on s for giving are as follows:
P - I O W ( t , s ) = l o g ( w f ( s ) + 1 ) l o g ( WF t M + 1 ) × P - I O L o g W ( t , s ) t ≠ s P - I O L o g W ( t , s ) t = s - - - ( 1 )
Wherein:
Wherein:S is User Defined keyword;Wf (t) is the microblogging number comprising word t;tMFor field correlation language material in by most a plurality of The word that microblogging is included,It is tMThe bar number of place microblogging;It is the microblogging number not comprising word s;Wf (t ∧ s) is same When the microblogging number comprising word t and word s;N is user's microblogging sum;spIt is that codomain is the smoothing factor of (0,1) so that It is not in situation that divisor is zero when being zero;Can be drawn by formula, before (1) formula is multiplied by drop weight factor, P-IOLogW Codomain in log (sp) arrive log (1/sp) between, the value of result of calculation P-IOW is bigger, declarer w and User Defined keyword S has field similarity higher, will assign the upper limit of P-IOLogW codomains, i.e. log for User Defined keyword in itself (1/sp)。
5. the realm information commending system based on microblog according to claim 1, it is characterised in that:The user is certainly In defining keyword expansion module, the extension term vector gone out to keyword expansion carry out it is linear plus and, so as to obtain final expansion During exhibition vector, expansion process specific algorithm is as follows:
(1) to each User Defined keyword set keyword, first, all extensions of keyword are calculated based on P-IOW Word weight;
(2) the expansion word weight vectors related to keyword are mapped to the related language material participle in field spatially, form extension Term vector;
(3) all of extension term vector of linear superposition, obtains the output knot of final spread vector, i.e. keyword expansion module Really.
6. the realm information commending system based on microblog according to claim 1, it is characterised in that:The linear conjunction And the implementation process of module is as follows:
(1) place is normalized using maximum normalization method respectively to the vector that keyword extraction and keyword expansion are generated Reason;
(2) for vector in each component, specific method for normalizing is as follows:
vnormal=v/vmax
Wherein:V is the initial value of vectorial a certain component;vmaxMaximum in important for vector, normalizes by maximum Afterwards, vector institute it is important (0,1] between, and institute it is important be non-zero, carry out afterwards SYSTEM OF LINEAR VECTOR weighting merging:
Vcombine=r × Vkw-extract+(1-r)×Vkw-expand
Wherein:R is user-defined merging weight proportion, Vkw-extractIt is the result vector of keyword extraction, Vkw-expandFor The result vector of keyword expansion generation, amalgamation result is the crucial term vector V for portraying user field interestcombine
7. the realm information commending system based on microblog according to claim 1, it is characterised in that:Described is similar Degree is calculated with personalized recommendation module, and it is as follows that domain correlation degree calculates specific formula:
Relevance T = V c o m b i n e · W · V I D F = Σ t i ∈ L v t i w t i I D F ( t i )
Wherein:The crucial term vector of user field interest is Vcombine;The corresponding word frequency vector of microblogging T to be recommended is W;Participle IDF vectors are VIDF;L is the vector space dimension sum after user's history microblogging participle;WithRespectively VcombineWith t on Wi Corresponding component;IDF(ti) it is tiIDF values.
8. a kind of realm information based on microblog recommends method, it is characterised in that:It is divided into data acquisition with pretreatment, field Keyword extraction, User Defined keyword expansion, linear combining, Similarity Measure and personalized recommendation and theme obtain six Individual step, realizes as follows:
(1) obtaining user's relevant microblog information carries out data prediction;Pretreatment work is included using the method pair of pattern match Stop words is filtered;Participle and part-of-speech tagging are carried out using Words partition system ICTCLAS5.0;Pre-processed results are user's History microblog data;
(2) pre-processed results carry out field keyword extraction, and keyword extraction will be repaiied using the present invention based on TextRank algorithm The TextRank for Weibo algorithms for changing are carried out without guidance, and TextRank for Weibo algorithms are divided into based on cooccurrence relation Non-directed graph construction and node weights based on figure calculate two stages;The construction phase of the non-directed graph based on cooccurrence relation is first The participle occurred in user's history microblogging is first converted into corresponding node;Then to the result of each microblogging participle in, own The binary participle of appearance is equivalent to the co-occurrence number of times in history microblogging to carrying out the construction on side, the weights on side;
Node weights calculation stages based on figure iterate to calculate the weight in each stage according to the algorithm idea of PageRank, until Untill the variable quantity of node weights converges to certain threshold values;
(3) if User Defined field interest keyword, pre-processed results separately will be transmitted to keyword expansion module;Keyword Extending will be carried out using the P-IOW algorithms based on P-IOLog algorithm improvements, using the co-occurrence based on keyword, be distributed and it The information such as the attribute of owning user calculate the similarity between keyword, using the word of the degree of correlation high as target keyword Spreading result;The multiple self-defined keywords of user input are supported simultaneously;For each self-defined keyword, keyword expansion module Extension term vector that it can be expanded carry out it is linear plus and, so as to obtain final spread vector;
(4) result of the two is carried out using maximum method for normalizing according to User Defined weight to two result vectors Normalization, is mapped among a unified span;After normalization, the vector after being normalized to two is carried out linearly Merge, merging process supports the weight of User Defined keyword extraction and keyword expansion, amalgamation result supplies associated recommendation mould Block carries out degree of correlation comparing, and relevance score is generated with to microblogging to be recommended;
(5) microblogging to be recommended that user subscribes to is carried out into participle and according to word frequency by its vectorization, then according to identifying User interest, dot product is carried out by user interest key term vector, the word frequency vector of microblogging to be recommended generation and IDF information vectors Computing, domain correlation degree is calculated using the method for vector space dot product;
(6) the field microblogging text that will recommend user is gathered as input, realizes that lexical item is clustered based on LDA strategies, completes master The discovery of topic, then the user field interest key word item obtained in theme lexical item set and linear combining module is carried out into the degree of correlation Calculate, establish thematic importance, and user's recommendation is carried out according to thematic importance.
CN201611075431.XA 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog Pending CN106776881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611075431.XA CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611075431.XA CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Publications (1)

Publication Number Publication Date
CN106776881A true CN106776881A (en) 2017-05-31

Family

ID=58900759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611075431.XA Pending CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Country Status (1)

Country Link
CN (1) CN106776881A (en)

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229871A (en) * 2017-07-17 2017-10-03 梧州井儿铺贸易有限公司 A kind of safe information acquisition device
CN107370664A (en) * 2017-07-17 2017-11-21 陈剑桃 A kind of effective microblogging junk user finds system
CN107436934A (en) * 2017-07-21 2017-12-05 上海斐讯数据通信技术有限公司 It is a kind of to orient the system and method for subscribing to the story of a play or opera
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107766482A (en) * 2017-10-13 2018-03-06 北京猎户星空科技有限公司 Information pushes and sending method, device, electronic equipment, storage medium
CN108255957A (en) * 2017-12-21 2018-07-06 杭州传送门网络科技有限公司 One kind recommends matching process based on Venture Capital field precision dataization
CN108287916A (en) * 2018-02-11 2018-07-17 北京方正阿帕比技术有限公司 A kind of resource recommendation method
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108319677A (en) * 2018-01-30 2018-07-24 中南大学 The alignment schemes of the cyberrelationship figure of dynamic change
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108763205A (en) * 2018-05-21 2018-11-06 阿里巴巴集团控股有限公司 A kind of brand alias recognition methods, device and electronic equipment
CN108846023A (en) * 2018-05-24 2018-11-20 普强信息技术(北京)有限公司 The unconventional characteristic method for digging and device of text
CN108932318A (en) * 2018-06-26 2018-12-04 四川政资汇智能科技有限公司 A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN109241238A (en) * 2018-06-27 2019-01-18 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109255126A (en) * 2018-09-10 2019-01-22 百度在线网络技术(北京)有限公司 Article recommended method and device
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN110019702A (en) * 2017-09-18 2019-07-16 阿里巴巴集团控股有限公司 Data digging method, device and equipment
CN110110207A (en) * 2018-01-18 2019-08-09 北京搜狗科技发展有限公司 A kind of information recommendation method, device and electronic equipment
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110489665A (en) * 2019-08-16 2019-11-22 北京信息科技大学 A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks
CN110633408A (en) * 2018-06-20 2019-12-31 北京正和岛信息科技有限公司 Recommendation method and system for intelligent business information
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 Urban domain knowledge detection system and method based on LDA topic model
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112749284A (en) * 2020-12-31 2021-05-04 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112784142A (en) * 2019-10-24 2021-05-11 北京搜狗科技发展有限公司 Information recommendation method and device
CN112861004A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Rich media determination method and device
CN113220994A (en) * 2021-05-08 2021-08-06 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN114048374A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Method and device for determining object to be recommended
CN116228282A (en) * 2023-05-09 2023-06-06 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴雨龙等: "一种面向企业的行业微博信息推荐方法", 《计算机应用与软件》 *
唐晓波等: "基于文本聚类与LDA相融合的微博主题检索模型研究", 《情报理论与实践》 *

Cited By (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN107370664A (en) * 2017-07-17 2017-11-21 陈剑桃 A kind of effective microblogging junk user finds system
CN107229871A (en) * 2017-07-17 2017-10-03 梧州井儿铺贸易有限公司 A kind of safe information acquisition device
CN107436934A (en) * 2017-07-21 2017-12-05 上海斐讯数据通信技术有限公司 It is a kind of to orient the system and method for subscribing to the story of a play or opera
CN107436934B (en) * 2017-07-21 2023-09-08 杭州吉吉知识产权运营有限公司 System and method for directionally subscribing to scenario
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107704512B (en) * 2017-08-31 2021-08-24 平安科技(深圳)有限公司 Financial product recommendation method based on social data, electronic device and medium
CN110019702B (en) * 2017-09-18 2023-04-07 阿里巴巴集团控股有限公司 Data mining method, device and equipment
CN110019702A (en) * 2017-09-18 2019-07-16 阿里巴巴集团控股有限公司 Data digging method, device and equipment
CN107766482A (en) * 2017-10-13 2018-03-06 北京猎户星空科技有限公司 Information pushes and sending method, device, electronic equipment, storage medium
CN109685085B (en) * 2017-10-18 2023-09-26 阿里巴巴集团控股有限公司 Main graph extraction method and device
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN108255957A (en) * 2017-12-21 2018-07-06 杭州传送门网络科技有限公司 One kind recommends matching process based on Venture Capital field precision dataization
CN110110207B (en) * 2018-01-18 2023-11-03 北京搜狗科技发展有限公司 Information recommendation method and device and electronic equipment
CN110110207A (en) * 2018-01-18 2019-08-09 北京搜狗科技发展有限公司 A kind of information recommendation method, device and electronic equipment
CN108319677A (en) * 2018-01-30 2018-07-24 中南大学 The alignment schemes of the cyberrelationship figure of dynamic change
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108287916A (en) * 2018-02-11 2018-07-17 北京方正阿帕比技术有限公司 A kind of resource recommendation method
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN108763205B (en) * 2018-05-21 2022-05-03 创新先进技术有限公司 Brand alias identification method and device and electronic equipment
CN108763205A (en) * 2018-05-21 2018-11-06 阿里巴巴集团控股有限公司 A kind of brand alias recognition methods, device and electronic equipment
CN108846023A (en) * 2018-05-24 2018-11-20 普强信息技术(北京)有限公司 The unconventional characteristic method for digging and device of text
CN110633408B (en) * 2018-06-20 2024-03-15 北京正和岛信息科技有限公司 Intelligent business information recommendation method and system
CN110633408A (en) * 2018-06-20 2019-12-31 北京正和岛信息科技有限公司 Recommendation method and system for intelligent business information
CN108932318A (en) * 2018-06-26 2018-12-04 四川政资汇智能科技有限公司 A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN108932318B (en) * 2018-06-26 2022-03-04 四川政资汇智能科技有限公司 Intelligent analysis and accurate pushing method based on policy resource big data
CN109241238A (en) * 2018-06-27 2019-01-18 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN109255126A (en) * 2018-09-10 2019-01-22 百度在线网络技术(北京)有限公司 Article recommended method and device
CN109635081B (en) * 2018-11-23 2023-06-13 上海大学 Text keyword weight calculation method based on word frequency power law distribution characteristics
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text clustering analysis method, device and terminal device
CN109376309B (en) * 2018-12-28 2022-05-17 北京百度网讯科技有限公司 Document recommendation method and device based on semantic tags
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
US11216504B2 (en) 2018-12-28 2022-01-04 Beijing Baidu Netcom Science And Technology Co., Ltd. Document recommendation method and device based on semantic tag
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110222160B (en) * 2019-05-06 2023-09-15 平安科技(深圳)有限公司 Intelligent semantic document recommendation method and device and computer readable storage medium
WO2020258481A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110489665B (en) * 2019-08-16 2023-11-14 北京信息科技大学 Microblog personalized recommendation method based on scene modeling and convolutional neural network
CN110489665A (en) * 2019-08-16 2019-11-22 北京信息科技大学 A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks
CN112784142A (en) * 2019-10-24 2021-05-11 北京搜狗科技发展有限公司 Information recommendation method and device
CN111831802B (en) * 2020-06-04 2023-05-26 北京航空航天大学 Urban domain knowledge detection system and method based on LDA topic model
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 Urban domain knowledge detection system and method based on LDA topic model
CN112749284B (en) * 2020-12-31 2021-12-17 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112749284A (en) * 2020-12-31 2021-05-04 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112364947B (en) * 2021-01-14 2021-06-29 北京育学园健康管理中心有限公司 Text similarity calculation method and device
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112861004A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Rich media determination method and device
CN112861004B (en) * 2021-02-20 2024-02-06 中国联合网络通信集团有限公司 Method and device for determining rich media
CN113220994A (en) * 2021-05-08 2021-08-06 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN113220994B (en) * 2021-05-08 2022-10-28 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN114048374A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Method and device for determining object to be recommended
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN116244496B (en) * 2022-12-06 2023-12-01 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN116228282A (en) * 2023-05-09 2023-06-06 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency
CN116228282B (en) * 2023-05-09 2023-08-11 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency

Similar Documents

Publication Publication Date Title
CN106776881A (en) A kind of realm information commending system and method based on microblog
Zhao et al. Connecting social media to e-commerce: Cold-start product recommendation using microblogging information
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
US7519588B2 (en) Keyword characterization and application
CN101420313B (en) Method and system for clustering customer terminal user group
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
Ye et al. Web services classification based on wide & Bi-LSTM model
CN105095433B (en) Entity recommended method and device
CN111552799B (en) Information processing method, information processing device, electronic equipment and storage medium
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN102043843A (en) Method and obtaining device for obtaining target entry based on target application
CN105740448B (en) More microblogging timing abstract methods towards topic
CN111552797B (en) Name prediction model training method and device, electronic equipment and storage medium
Lytvyn et al. Textual Content Categorizing Technology Development Based on Ontology.
CN110472043A (en) A kind of clustering method and device for comment text
CN106202065A (en) A kind of across language topic detecting method and system
Kwapong et al. A knowledge graph based framework for web API recommendation
Khan et al. Collaborative filtering based online recommendation systems: A survey
Hu et al. A Web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph
Jinarat et al. Short text clustering based on word semantic graph with word embedding model
CN103095849A (en) A method and a system of spervised web service finding based on attribution forecast and error correction of quality of service (QoS)
Chen et al. Learning the structures of online asynchronous conversations
CN105426382A (en) Music recommendation method based on emotional context awareness of Personal Rank
KR20180113444A (en) Method, apparauts and system for named entity linking and computer program thereof
Abulaish et al. A layered approach for summarization and context learning from microblogging data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170531

WD01 Invention patent application deemed withdrawn after publication