A kind of method and system for advertisement recommendation based on microblogging
Technical field
The invention belongs to Data Mining, more particularly, to a kind of method and system for advertisement recommendation based on microblogging.
Background technology
With the social network sites such as Sina weibo, Tengxun's microblogging at home popular, the social media such as microblogging not only becomes
Netizen's issue, platform that is shared, propagating information, and have accumulated the behavioral data of extensive netizen.In May, 2012, Sina
Microblogging division department vice general manager Lu Yi points out, more than 300,000,000, user averagely issues more than 100,000,000 Sina weibo register user daily
Bar content of microblog.The radix of microblog users is big, and data volume is big, if microblogging operation system can be analyzed and excavate existing magnanimity number
According to more accurately being judged to the interest of microblog users according to analysis result, the interest according to microblog users is to it
Carry out advertisement putting, then advertisement microblog users being pushed will make microblog users, businessman and microblogging operator tripartite all benefited.
Existing microblogging advertisement recommends method mainly to utilize the label in individual subscriber data or the search using user
Record carries out interest judgement to microblog users, and then it is pushed with the advertisement that user may be interested.Due to a lot of users
Inside people's data and not contain the label or user label filled in when creating personal information inaccurate, therefore marked by user
Label it is carried out advertisement to be recommended to reach good effect.And by the search record of microblog users judging user's
Interest has certain limitation, is only capable of representing being currently needed for of this user and its interest accurately can not be sentenced
Disconnected.
Content of the invention
Embodiments provide a kind of advertisement based on microblogging and recommend method it is intended to solve existing method in excavation
During user profile, accuracy is low, thus leading to the bad problem of advertisement recommendation effect.
The embodiment of the present invention is achieved in that method is recommended in a kind of advertisement based on microblogging, and methods described includes following
Step:
Read the microblog data of user;
The microblog data that initialization is read, to obtain microblog text lexical item set, the microblog data that described initialization is read
Including the special symbol removing in the microblog data reading, non-Chinese character, participle;
Delete the stop words of described microblog text lexical item set, to obtain microblogging text primitive character lexical item set;
Described microblogging text primitive character lexical item set is mapped with the feature lexical item dictionary previously generating, is judged institute
State the lexical item in microblogging text primitive character lexical item set whether occur in described in the feature lexical item dictionary that previously generates, and count
Calculate lexical item in the described microblogging text primitive character lexical item set in the now described feature lexical item dictionary previously generating
Word frequency-reverse document-frequency tf-idf value, using as in the feature lexical item dictionary previously generating described in described occurring in described in
Lexical item in microblogging text primitive character lexical item set is in the eigenvalue of microblogging;
Whether the lexical item of the feature lexical item dictionary previously generating described in judgement occurs in described microblogging text primitive character word
In set, and the feature lexical item previously generating described in not appearing in described microblogging text primitive character lexical item set
The eigenvalue of the lexical item of dictionary is labeled as 0;
Using the disaggregated model being previously obtained, the microblog data of user is categorized in the classification dividing in advance automatically;
With the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
The another object of the embodiment of the present invention is to provide a kind of advertisement commending system based on microblogging, described system bag
Include:
First data reads in module, for reading the microblog data of user;
First data initialization module, the microblog data reading for initialization, to obtain microblog text lexical item set, institute
State the special symbol initializing the microblog data reading and including in the microblog data that removal is read, non-Chinese character, participle;
Fisrt feature extraction module, for deleting the stop words of described microblog text lexical item set, to obtain microblogging text
Primitive character lexical item set;
First eigenvector module, for by described microblogging text primitive character lexical item set and the feature previously generating
Lexical item dictionary is mapped, and judges whether the lexical item in described microblogging text primitive character lexical item set occurs in described pre- Mr.
In the feature lexical item dictionary becoming, and calculate occur in described in described microblogging text in the feature lexical item dictionary that previously generates original
The tf-idf value of the lexical item in feature lexical item set, using as in the feature lexical item dictionary previously generating described in described occurring in
Lexical item in described microblogging text primitive character lexical item set is in the eigenvalue of microblogging;And for previously generating described in judging
Whether the lexical item of feature lexical item dictionary occurs in described microblogging text primitive character lexical item set, and will not appear in described
The eigenvalue of the lexical item of feature lexical item dictionary previously generating described in microblogging text primitive character lexical item set is labeled as 0;
Sort module, is divided in advance for being automatically categorized into the microblog data of user using the disaggregated model being previously obtained
Classification in;
Recommending module, for the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
In the embodiment of the present invention, because the information that the microblog data that user issues comprises than user tag has more in real time
Property, more can represent the interest preference of user, the judged result therefore obtaining by the microblog data analyzing user is more accurate, thus
The advertisement recommended is also more accurate, and effect is also more preferable.
Brief description
Fig. 1 is the flow chart that method is recommended in a kind of advertisement based on microblogging that first embodiment of the invention provides;
Fig. 2 is a kind of advertisement commending system structure chart based on microblogging that second embodiment of the invention provides;
Fig. 3 is the advertisement commending system structure chart based on microblogging for the another kind that second embodiment of the invention provides.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with drawings and Examples, right
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only in order to explain the present invention, and
It is not used in the restriction present invention.
The embodiment of the present invention is carried out excavating, is classified by the microblog data that user is issued, and judges the interest of this user
Preference, and then recommend corresponding advertisement to this user.
Embodiments provide one kind:Method and system for advertisement recommendation based on microblogging.
Methods described includes:Read the microblog data of user;
The microblog data that initialization is read, to obtain microblog text lexical item set, the microblog data that described initialization is read
Including the special symbol removing in the microblog data reading, non-Chinese character, participle;
Delete the stop words of described microblog text lexical item set, to obtain microblogging text primitive character lexical item set;
Described microblogging text primitive character lexical item set is mapped with the feature lexical item dictionary previously generating, is judged institute
State the lexical item in microblogging text primitive character lexical item set whether occur in described in the feature lexical item dictionary that previously generates, and count
Calculate lexical item in the described microblogging text primitive character lexical item set in the now described feature lexical item dictionary previously generating
Word frequency-reverse document-frequency tf-idf value, using as in the feature lexical item dictionary previously generating described in described occurring in described in
Lexical item in microblogging text primitive character lexical item set is in the eigenvalue of microblogging;
Whether the lexical item of the feature lexical item dictionary previously generating described in judgement occurs in described microblogging text primitive character word
In set, and the feature lexical item previously generating described in not appearing in described microblogging text primitive character lexical item set
The eigenvalue of the lexical item of dictionary is labeled as 0;
Using the disaggregated model being previously obtained, the microblog data of user is categorized in the classification dividing in advance automatically;
With the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
Described system includes:First data reads in module, for reading the microblog data of user;
First data initialization module, the microblog data reading for initialization, to obtain microblog text lexical item set, institute
State the special symbol initializing the microblog data reading and including in the microblog data that removal is read, non-Chinese character, participle;
Fisrt feature extraction module, for deleting the stop words of described microblog text lexical item set, to obtain microblogging text
Primitive character lexical item set;
First eigenvector module, for by described microblogging text primitive character lexical item set and the feature previously generating
Lexical item dictionary is mapped, and judges whether the lexical item in described microblogging text primitive character lexical item set occurs in described pre- Mr.
In the feature lexical item dictionary becoming, and calculate occur in described in described microblogging text in the feature lexical item dictionary that previously generates original
The tf-idf value of the lexical item in feature lexical item set, using as in the feature lexical item dictionary previously generating described in described occurring in
Lexical item in described microblogging text primitive character lexical item set is in the eigenvalue of microblogging;And for previously generating described in judging
Whether the lexical item of feature lexical item dictionary occurs in described microblogging text primitive character lexical item set, and will not appear in described
The eigenvalue of the lexical item of feature lexical item dictionary previously generating described in microblogging text primitive character lexical item set is labeled as 0;
Sort module, is divided in advance for being automatically categorized into the microblog data of user using the disaggregated model being previously obtained
Classification in;
Recommending module, for the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
In the embodiment of the present invention, because the information that the microblog data that user issues comprises than user tag has more in real time
Property, more can represent the interest preference of user, the judged result therefore obtaining by the microblog data analyzing user is more accurate, thus
The advertisement recommended is also more accurate, and effect is also more preferable.
In order to technical solutions according to the invention are described, to illustrate below by specific embodiment.
Embodiment one:
Fig. 1 shows that method is recommended in a kind of advertisement based on microblogging that first embodiment of the invention provides, and details are as follows:
Step S11, reads the microblog data of user.
The microblog data of user in this step, can be obtained in advance, the microblog data of acquisition is stored in data base, need
When the microblog data of certain user being analyzed, then the microblog data reading this user.
Step S12, the microblog data that initialization is read, to obtain microblog text lexical item set, described initialization is read
Microblog data include remove read microblog data in special symbol, non-Chinese character, participle.
In this step, initialization process is carried out to every microblog data, such as remove special symbol, the removals such as punctuation mark
Non- Chinese character, participle etc., obtain a microblog text lexical item set after initialization process.
Step S13, deletes the stop words of described microblog text lexical item set, to obtain microblogging text primitive character lexical item collection
Close.
Step S14, described microblogging text primitive character lexical item set is reflected with the feature lexical item dictionary previously generating
Penetrate, judge the lexical item in described microblogging text primitive character lexical item set whether occur in described in the feature lexical item word that previously generates
In allusion quotation, and calculate occur in described in described microblogging text primitive character lexical item set in the feature lexical item dictionary that previously generates
Lexical item word frequency-reverse document-frequency(Term frequency-inverse document frequency, tf-idf)
Value, using as the described microblogging text primitive character lexical item set in the feature lexical item dictionary previously generating described in described occurring in
In lexical item microblogging eigenvalue.
In this step, the microblogging text primitive character lexical item set of every microblogging is mapped to feature lexical item dictionary,
If the lexical item of microblogging text primitive character lexical item set is in feature lexical item dictionary, then the tf-idf value calculating this lexical item is made
For eigenvalue in this microblogging for this lexical item.
Step S15, it is former whether the lexical item of the feature lexical item dictionary previously generating described in judgement occurs in described microblogging text
In beginning feature lexical item set, and previously generate described in not appearing in described microblogging text primitive character lexical item set
The eigenvalue of the lexical item of feature lexical item dictionary is labeled as 0.
In this step, not in feature lexical item dictionary, this lexical item is ignored the lexical item of microblogging text primitive character lexical item set,
If the lexical item in feature lexical item dictionary does not appear in microblogging text primitive character lexical item set, the eigenvalue of this lexical item is
0;Finally the microblogging text of every microblogging is transformed into the characteristic vector that dimension is 5000.
The microblog data of user is categorized into the class dividing in advance using the disaggregated model being previously obtained by step S16 automatically
In not.
In this step, plurality of classes can be divided according to the actual requirements in advance, such as, divide 12 kinds of classifications in advance, have respectively
Sport category, healthy class, educational, GT grand touring, scientific and technological class, automotive-type, game class, beauty treatment, hairdressing and body shaping class, cuisines class, clothing footwear
Boots bag class, entertainment class, other.
Wherein, sport category includes the contents such as competitive sports, physical culture newpapers and periodicals, sports star;
Wherein, healthy class includes the contents such as healthy general knowledge, medicine, physical condition;
Wherein, the training organization such as educational inclusion New Orient, new navigation channel, the study condition of individual, learning intent, go abroad and stay
Etc. content;
Wherein, GT grand touring includes the contents such as sight spot, recreation ground, travel abroad, free walker, hotel;
Wherein, scientific and technological class includes the contents such as mobile phone, computer, digital product;
Wherein, automotive-type includes the contents such as automobile, automobile journal;
Wherein, game class includes the contents such as mobile phone games, web game, online game;
Wherein, beauty treatment, hairdressing and body shaping class includes the contents such as skin care item, cosmetics, manicure, slim, washing product;
Wherein, cuisines class includes the contents such as food, good-for-nothing, recipe;
Wherein, entertainment class includes the contents such as amusement circles, concert, modern drama, exhibition;
Wherein, other include the contents such as ownness, personal emotion, social view, life view.
Step S17, with the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
In this step, if the result of automatic classification is that the microblog data of user is included into certain class, recommends to user and be somebody's turn to do
The corresponding advertisement of classification.Here advertisement includes news, music, film, microblogging etc..
In the embodiment of the present invention, carry out excavating, classify by the microblog data that user is issued, judge that this user's is emerging
Interesting preference, and then recommend corresponding advertisement to this user.The information being comprised than user tag due to the microblog data that user issues
With more real-time, the interest preference of user, the judged result therefore obtaining more can be represented by the microblog data analyzing user
More accurate, thus the advertisement recommended is also more accurate, effect is also more preferable.
As one embodiment of the present invention, in step S16, using the disaggregated model being previously obtained by the microblogging number of user
Comprise the steps according to before the step being automatically categorized in the classification dividing in advance:
Step A, reading training microblog data.
In this step, read the microblog data as training for the microblog data of multiple users, to improve follow-up excavation as far as possible
Accuracy.
Step B, the training microblog data handmarking of described reading is the classification that divides in advance.
In this step, the every microblog data reading is labeled as the class in the classification dividing in advance by several makers,
In the classification of every microblog data of labelling, using the principle that the minority is subordinate to the majority.
The training microblog data that step C, initialization are read, to obtain microblog text lexical item set, described initialization is read
Training microblog data include removing special symbol in the training microblog data reading, non-Chinese character, in participle.
Step D, the stop words of the described microblog text lexical item set of deletion, to obtain microblogging text primitive character lexical item collection
Close.
Step E, generation feature lexical item dictionary.
In this step, the step generating feature lexical item dictionary specifically includes:Calculate microblogging text primitive character lexical item set
In each lexical item association relationship;Choose association relationship ranking front N N number of lexical item as feature lexical item dictionary lexical item, institute
Stating N is integer, and N is more than 0.For example select 5000 lexical items of association relationship highest as the lexical item of feature lexical item dictionary, generate
Feature lexical item dictionary can be arranged according to the height of association relationship.
Step F, described microblogging text primitive character lexical item set is mapped with described feature lexical item dictionary, judged institute
Whether the lexical item stated in microblogging text primitive character lexical item set occurs in described feature lexical item dictionary, and calculates and occur in institute
State the tf-idf value of the lexical item in the described microblogging text primitive character lexical item set in feature lexical item dictionary, using as described go out
Lexical item in described microblogging text primitive character lexical item set in described feature lexical item dictionary is in the eigenvalue of microblogging now.
Step G, judge whether the lexical item of described feature lexical item dictionary occurs in described microblogging text primitive character lexical item collection
In conjunction, and the spy by the lexical item of the described feature lexical item dictionary not appeared in described microblogging text primitive character lexical item set
Value indicative is labeled as 0.
Step H, the characteristic vector being formed using the calculated all eigenvalues of default Algorithm for Training, are divided with obtaining
Class model.
In this step, train the corresponding eigenvectors matrix of all microblog data, subsequently excavate the microblogging number of certain user
According to when result training after can be used directly.
Wherein, default algorithm includes any one algorithm following:Support vector machines, Naive Bayes Classification Algorithm, god
Close on sorting algorithm, genetic algorithm through network, K.
In the present embodiment, by analyzing the microblog data of a large number of users, generate feature lexical item dictionary, this feature lexical item dictionary
There is provided a reference standard for the later interest preference excavating certain user.
As one embodiment of the present invention, step S17, with the result of automatic classification as foundation, to reading microblog data
The step of user's recommended advertisements specifically include:Every percentage ratio shared by class microblogging in the microblogging of counting user;Statistics is every
Label in microblogging data is mated percentage ratio shared by class microblogging with user, and by hundred shared by the classification that the match is successful
Divide than double;Recommend the advertisement of the M classification in front M for the ranking to the user reading microblog data, described M is integer, M is more than 0.
In the present embodiment, the history microblogging of user is carried out with classification and counts every class microblogging percentage and this user
Label in data is mated, if label is contained within certain class, then such microblogging percentage is double, finally selects hundred
Divide ratio highest M classification, classification is recommended in the advertisement for example selecting three classifications as this user.Preferably, after a period of time
Can recalculate and show that classification is recommended in the up-to-date advertisement of this user.
Embodiment two:
Fig. 2 shows a kind of structure of advertisement commending system based on microblogging that second embodiment of the invention provides, in order to
It is easy to illustrate, illustrate only the part related to the embodiment of the present invention.
Should be can be used for various by wired or wireless network connection server based on the advertisement commending system of microblogging
The information processing terminal, such as mobile phone, pocket computer(Pocket Personal Computer, PPC), palm PC,
Computer, notebook computer, personal digital assistant(Personal Digital Assistant, PDA)Deng can be operate in
Unit that software unit in these information processing terminals, hardware cell or software and hardware combine is it is also possible to as independent
Suspension member is integrated in these information processing terminals or runs in the application system of these information processing terminals, wherein:
First data reads in module 201, for reading the microblog data of user.
First data initialization module 202, the microblog data reading for initialization, to obtain microblog text lexical item collection
Close, described initialization read microblog data include remove read microblog data in special symbol, non-Chinese character, participle
In.
Fisrt feature extraction module 203, for deleting the stop words of described microblog text lexical item set, to obtain microblogging literary composition
This original feature lexical item set.
First eigenvector module 204, for by described microblogging text primitive character lexical item set with previously generate
Feature lexical item dictionary is mapped, and judges whether the lexical item in described microblogging text primitive character lexical item set occurs in described pre-
In the feature lexical item dictionary first generating, and calculate occur in described in described microblogging text in the feature lexical item dictionary that previously generates
The tf-idf value of the lexical item in primitive character lexical item set, using as the feature lexical item dictionary previously generating described in described occurring in
In described microblogging text primitive character lexical item set in lexical item microblogging eigenvalue.And be used for judging described pre- Mr.
Whether the lexical item of the feature lexical item dictionary becoming occurs in described microblogging text primitive character lexical item set, and will not appear in
The eigenvalue labelling of the lexical item of feature lexical item dictionary previously generating described in described microblogging text primitive character lexical item set
For 0.
Wherein, through the calculating of first eigenvector module 204, the microblog data of every microblogging is changed into one the most at last
Individual latitude is 5000 characteristic vector.
Sort module 205, for being automatically categorized into the microblog data of user in advance using the disaggregated model being previously obtained
In the classification dividing.
Wherein, the classification dividing in advance can be 12 classes, specifically as shown in step S16, repeats no more here.
Recommending module 206, for the result of automatic classification as foundation, to the user's recommended advertisements reading microblog data.
Wherein, advertisement here includes the contents such as news, music, film, microblogging.
In the embodiment of the present invention, by excavating to the microblog data reading, divide generic, and recommend to user
The advertisement related to dividing classification.Because microblog data can reflect the interest preference of user in time, therefore pass through to analyze user
The judged result that obtains of microblog data more accurate, thus the advertisement recommended is also more accurate, effect is also more preferable.
Fig. 3 shows another structure of the advertisement commending system based on microblogging, and another as the present invention is preferable to carry out
Example, described is also included based on the advertisement commending system of microblogging:
Second data reads in module 301, for reading training microblog data.
Wherein, the microblog data of reading is the microblog data of multiple users.
Manual sort's module 302, for being the classification dividing in advance by the training microblog data handmarking of described reading.
Second data initialization module 303, the training microblog data reading for initialization, to obtain microblog text lexical item
Set, described initialization read training microblog data include remove reading training microblog data in special symbol, non-in
In Chinese character, participle.
Second feature extraction module 304, for deleting the stop words of described microblog text lexical item set, to obtain microblogging literary composition
This original feature lexical item set.
Feature lexical item dictionary generation module 305, for generating feature lexical item dictionary.
Wherein, feature lexical item dictionary generation module 305 includes:
Association relationship computing module, for calculating the mutual information of each lexical item in microblogging text primitive character lexical item set
Value.
Feature lexical item dictionary lexical item selecting module, for choosing N number of lexical item in front N for the association relationship ranking as Feature Words
The lexical item of item dictionary, described N is integer, and N is more than 0.
Second feature vectorization module 306, for by described microblogging text primitive character lexical item set and described Feature Words
Item dictionary is mapped, and judges whether the lexical item in described microblogging text primitive character lexical item set occurs in described feature lexical item
In dictionary, and calculate the lexical item in the described microblogging text primitive character lexical item set occurring in described feature lexical item dictionary
Tf-idf value, using as in the described described microblogging text primitive character lexical item set occurring in described feature lexical item dictionary
Lexical item is in the eigenvalue of microblogging.And it is former for judging whether the lexical item of described feature lexical item dictionary occurs in described microblogging text
In beginning feature lexical item set, and the described feature lexical item word in described microblogging text primitive character lexical item set will not appeared in
The eigenvalue of the lexical item of allusion quotation is labeled as 0.
Training module 307, for the feature that formed using the calculated all eigenvalues of default Algorithm for Training to
Amount, to obtain disaggregated model.
Wherein, default algorithm includes any one algorithm following:
Support vector machines, Naive Bayes Classification Algorithm, neutral net, K close on sorting algorithm, genetic algorithm.
In the present embodiment, by analyzing the microblog data of a large number of users, generate feature lexical item dictionary, this feature lexical item dictionary
There is provided a reference standard for the later interest preference excavating certain user.
As one embodiment of the present invention, described recommending module 206 includes:
Data statistics module, for the percentage ratio shared by class microblogging every in the microblogging of counting user.
Data match module, for by percentage ratio and label in microblogging data for the user shared by every class microblogging of statistics
Mated, and will be double for the percentage ratio shared by the classification that the match is successful.
Advertisement recommending module, for recommending the advertisement of the M classification in front M for the ranking, institute to the user reading microblog data
Stating M is integer, and M is more than 0.
In the present embodiment, only choose ranking and recommend client in the advertisement of front M classification, browse pressure not increasing client
On the basis of make advertisement putting more accurate.
In embodiments of the present invention, the microblog data by issuing to user carries out excavating, classifies, and combines user micro-
Rich label information judges the interest preference of this user, and then recommends corresponding advertisement to this user.Issued due to user
The information that microblog data comprises than user tag has more real-time, more can represent the interest preference of user, therefore passes through analysis
The judged result that the microblog data of user and label information obtain is more accurate than only analyzing tags information, thus the advertisement recommended
More accurate, effect is also more preferable.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention
Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.