CN106446191A - Logistic regression based multi-feature network popular tag prediction method - Google Patents
Logistic regression based multi-feature network popular tag prediction method Download PDFInfo
- Publication number
- CN106446191A CN106446191A CN201610864860.9A CN201610864860A CN106446191A CN 106446191 A CN106446191 A CN 106446191A CN 201610864860 A CN201610864860 A CN 201610864860A CN 106446191 A CN106446191 A CN 106446191A
- Authority
- CN
- China
- Prior art keywords
- label
- tag
- network
- populartag
- unpopulartag
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9562—Bookmark management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a Logistic regression based multi-feature network popular tag prediction method. The method includes the following steps: (1) constructing a weighted undirected network Tag network according to question-and-answer website post data; (2) extracting a popular tag set and an unpopular tag set according to tag occurrence frequency; (3) extracting network features of tags, tag proposer attribute characteristics and tag after-propose attribute change characteristics as feature vectors; (4) adopting Logistics multiple regression training, and constructing a tag classification model. Relevance between the tags are taken into consideration, the tags are classified according to multiple features, and precision on prediction of potential popular tags is high. The method is in favor of guiding user to select reasonable tags and beneficial to website designers to provide higher-quality tags.
Description
Technical field
The present invention relates to data mining and field of computer technology, more particularly to a kind of many based on Logistic recurrence
Character network popular Tag Estimation method.
Background technology
Web tab (Tag) is a kind of organizational form of internet information content, generally closely related with content by some
Keyword composition, it can help people easily to describe and categorised content, also allow for the retrieval of information simultaneously and share.By
In the convenience of web tab, Tag Estimation and label recommendations have obtained widely should in recent years in numerous network platforms
With such as question and answer website StackExchange, photo sharing website Flickr, and food and drink comment website Yelp.Using suitable
Label is either still all extremely important for a user to website.For website, suitable label can help website pair
User carries out personalized recommendation, increases viscosity and the website clicking rate of user;For a user, label can help user quick
Navigate to needed for oneself, it is to avoid lose time to browse garbage.In label is chosen, how to choose potential popular label is ten
Divide crucial step, because popular label often represents the demand of most of user.
The Main Basiss that at present information is entered with row label selection are that information is sent out with the word degree of correlation of label and information
Play self attributes of person etc..But there are various drawbacks in such selection, be mainly manifested in:1. have ignored label potential popular become
Gesture;2. have ignored the correlation between label and label;3. cold content leads to unexpected winner label so that information can not be effective
Search;4. only take into account a few features so that part labels selection tend to unilateral.
Therefore, in order that user preferably chooses to label when releasing news content, choose potential as much as possible
Popular label.The multiple features network flow row label Forecasting Methodology that the present invention is returned based on Logistic solves following two substantially to ask
Topic:(1) predict the following fashion trend of label;(2) apply substantial amounts of feature that the fashion trend of label is carried out with quantitative portraying.
Content of the invention
In order to overcome existing label selecting system to have ignored correlation, evaluation between the potential fashion trend of label and label
The single deficiency of feature, the invention provides a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence,
Consider not only the correlated characteristic between multiple features and label, the fashion trend of label also can be better anticipated simultaneously.
The technical solution adopted for the present invention to solve the technical problems is as follows:
A kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence, comprises the steps:
S1:Data prediction:Collect the information content of website and label data, and by site information content temporally ascending order
Arrangement, the model that ratio is front α % is considered as the Temporal Data before label network is stablized, and deletes this part of Temporal Data;
Before choosing from the remaining data in website, the data of preset ratio is as training data;
S2:Build label Tag network, to the Tag occurring in the same information content so as to formation between any two connects side,
All information are traveled through, obtains the label network figure G of Undirected networks that has the rightTag, the weight of network is the secondary of both common appearance
Number;
S3:The frequency descending that each label occurs in model according to it, the Tag taking front β % ratio is as popular
Tag set UPopularTag;
S4:Find non-popular tag set UUnPopularTag, to each popular label t ∈ UPopularTag, search for label
The time that t occurs for the first time, and centered on this time, search nearest from this time, occur for the first time, be not belonging to simultaneously
UPopularTagLabel as non-streaming row label, the non-streaming row label set U of composition comparisonUnPopularTag;
S5:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the network characterization of Tag in it,
In the Undirected networks G that has the rightTagOn, extract neighbor node angle value, the neighbor node degree center that sample label occurs connecting for the first time
Property;
S6:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the presenter extracting Tag in it belongs to
Property feature, specifically include Tag presenter propose during this Tag with issue the information content quantity, the length of the information content;
S7:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the attribute extracting Tag in it changes
Feature, after specifically including this Tag proposition, the answer quantity that in 5 days, the corresponding model of this Tag receives;
S8:Using Logistic multiple regression, with set U={ UPopularTag,UUnPopularTagIn label feature conduct
Training data, trains and builds label classifier model.
Further, in described step S1, the determination mode of α % is, when occurring default the hundred of website whole Tag number of labels
Point than when, as the intercept point of α %.Its objective is to guarantee label network be not subject to website to set up at the beginning of staff to website
The impact that label debugging causes;
Further, the node angle value of neighbours i in described step S5, is calculated using formula (1)
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
Calculate the node degree centrality of neighbours i using formula (2)
Beneficial effects of the present invention are:Consider correlation between label, label is classified, for pre- according to multiple features
Survey potential popular label and there is higher precision.Not only improve guiding user and select rational label, be also beneficial to Web Hosting
Person provides higher-quality label.
Brief description
Fig. 1 is a kind of flow chart of multiple features labeling method based on Logistic recurrence of the embodiment of the present invention.
Fig. 2 is the label frequency of occurrences schematic diagram of the embodiment of the present invention.
Specific embodiment
With reference to Figure of description, the specific embodiment of the present invention is described in further detail.
See figures.1.and.2, a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence, the present invention
Enter row label using data disclosed in the Tex.Stackexchange.com official of question and answer website StackExchange subnet station to divide
The modeling analysis of class system, the time that each model of original data record occurs, post people ID, the information such as model label.With
As a example this patent research label Tag, we extract the time that this label occurs for the first time, label presenter ID, neighbours' tag ID
Etc. information.
In the present embodiment, a kind of multiple features labeling method based on Logistic recurrence, it concretely comprises the following steps:
1) build label Tag network:To the model data delivered, do following process:
1.1) travel through model data, obtain all of Tag tag set TI, I ∈ N, wherein N represent the total quantity of label.
Peek amount be N × 20% label as the number of labels needed for web site tags point of safes, its beneficial manner be prevent website from building
Vertical part, staff brings noise to the debugging of web site contents to model;
1.2) by model ascending order arrangement sequentially in time, travel through model data again, when the quantity obtaining different labels
During for N × 20%, recording now traversed model number is NInstablePosts, the model time of delivering now is considered as website
Label stabilization time;
1.3) determineWherein NPostsFor delivering the total quantity of model;
1.4) build Tag network:The model of α % before removal, reads the model of front 80% data volume in question and answer website data
As training data.Wherein, Tag network struction mode is:To the Tag occurring in same model so as to be formed between any two
Lian Bian.All information are traveled through, obtains the label network figure G of Undirected networks that has the rightTag, the weight of network is that both are common to be occurred
Number of times;
2) obtain popular tag set UPopularTag:To the model data delivered, do following process:
2.1) travel through model data, obtain the frequency that each Tag occurs in model;
2.2) according to Tag frequency of occurrences descending, the Tag taking front β % ratio is as popular tag set UPopularTag,
Here, we select β %=5%;
3) obtain non-streaming row label set UUnPopularTag, concretely comprise the following steps:
3.1) to each label Tag, travel through model, obtain the time of occurrence first of each label;
3.2) to each popular label t ∈ UPopularTag, (remaining label is not present in popular to search for remaining labels all
In label)With the time difference of this label, i.e. remaining poor Δ T of time of occurrence first with this label;
3.3) ascending order arrangement is carried out to this time difference Δ T, take the minimum label t' of Δ T as non-streaming row label, thus shape
Become non-streaming row label set UUnPopularTag
4) extract the network characterization of Tag, concretely comprise the following steps:
4.1) to each label t ∈ { UPopularTag,UUnPopularTag, the node degree of neighbours i is calculated using formula (1)
Value
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
4.2) formula (2) is adopted to calculate the node degree centrality of neighbours i
4.3) normalization neighbor node degree, neighbor node degree centrality, normalization denominator is neighbor node numerical value
5) extract sample Tag presenter's attributive character, concretely comprise the following steps:
5.1) to each sample label t ∈ { UPopularTag,UUnPopularTag, when obtaining this label and proposing first, propose
No. ID of person, label time of occurrence first;
5.2) by model ascending order arrangement sequentially in time, find out label first before time of occurrence, this presenter ID is total
Common enquirement quantity, answer quantity, as Tag presenter's attributive character;
6) extract the attribute variation feature of sample Tag, concretely comprise the following steps:Sample label set U=to training
{UPopularTag,UUnPopularTag, after this Tag proposes, the answer quantity that in 5 days, this Tag receives altogether;
7) Logistic multiple regression train classification models:By above-mentioned sample label set U={ UPopularTag,
UUnPopularTag, and the neighbor node angle value of corresponding Tag, neighbor node centrad, Tag presenter put question to quantity, Tag
Presenter's answer quantity, Tag propose after this 5 features of answer quantity of receiving of certain time as input, with Logistics
Label classifier model is trained and built to multiple regression, as grader,;
It is the present invention as mentioned above in the StackExchange subnet station Tex.Stackexchange.com of question and answer website
Labeling embodiment introduce, by build network by way of by label between correlation include feature;By considering to mark
Sign neighbors feature, consider that the modes such as label presenter's feature, label temporal evolution feature increased the characteristic of labeling.
The whether popular judgement of label is finally given by training pattern, provides directive significance to the label recommendations system constructing of website.
Claims (3)
1. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence is it is characterised in that methods described bag
Include following steps:
S1:Data prediction:Collect the information content and the label data of website, and temporally ascending order is arranged by site information content
Row, the model that ratio is front α % is considered as the Temporal Data before label network is stablized, and deletes this part of Temporal Data;From
Before choosing in the remaining data in website, the data of preset ratio is as training data;
S2:Build label Tag network, to the Tag occurring in the same information content so as to formation between any two connects side;To institute
There is information to travel through, obtain the label network figure G of Undirected networks that has the rightTag, the weight of network is both common number of times occurring;
S3:The frequency descending that each label occurs in model according to it, the Tag taking front β % ratio is as popular label
Set UPopularTag;
S4:Find non-popular tag set UUnPopularTag, to each popular label t ∈ UPopularTag, search for label t first
The time of secondary appearance, and centered on this time, search nearest from this time, occur for the first time, be not belonging to simultaneously
UPopularTagLabel as non-streaming row label, the non-streaming row label set U of composition comparisonUnPopularTag;
S5:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the network characterization of Tag in it, having
Power Undirected networks GTagOn, extract neighbor node angle value, the neighbor node degree centrality that sample label occurs connecting for the first time;
S6:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the presenter's attribute extracting Tag in it is special
Levy, specifically include the quantity that Tag presenter proposes the information content to issue during this Tag, the length of the information content;
S7:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the attribute variation feature of Tag in it,
After specifically including this Tag proposition, the answer quantity that in 5 days, the corresponding model of this Tag receives;
S8:Using Logistic multiple regression, with set U={ UPopularTag,UUnPopularTagIn label feature as training
Data, trains and builds label classifier model.
2. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1, it is special
Levy and be:In described step S1, the determination mode of α % is, when the preset percentage of website whole Tag number of labels
Wait, as the intercept point of α %.
3. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1 or 2, its
It is characterised by:The node angle value of neighbours i in described step S5, is calculated using formula (1)
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
Calculate the node degree centrality of neighbours i using formula (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610864860.9A CN106446191B (en) | 2016-09-30 | 2016-09-30 | A kind of multiple features network flow row label prediction technique returned based on Logistic |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610864860.9A CN106446191B (en) | 2016-09-30 | 2016-09-30 | A kind of multiple features network flow row label prediction technique returned based on Logistic |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106446191A true CN106446191A (en) | 2017-02-22 |
CN106446191B CN106446191B (en) | 2019-11-05 |
Family
ID=58169804
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610864860.9A Active CN106446191B (en) | 2016-09-30 | 2016-09-30 | A kind of multiple features network flow row label prediction technique returned based on Logistic |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446191B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951471A (en) * | 2017-03-06 | 2017-07-14 | 浙江工业大学 | A kind of construction method of the label prediction of the development trend model based on SVM |
CN108629358A (en) * | 2017-03-23 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | The prediction technique and device of object type |
CN110380954A (en) * | 2017-04-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Data sharing method and device, storage medium and electronic device |
CN115002030A (en) * | 2022-04-27 | 2022-09-02 | 安徽工业大学 | Website fingerprint identification method and device, storage and processor |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130018697A1 (en) * | 2011-07-15 | 2013-01-17 | Giovanni Giuffrida | System to forecast performance of online news articles to suggest the optimal homepage layout to maximize article readership and readers stickiness |
CN103631874A (en) * | 2013-11-07 | 2014-03-12 | 微梦创科网络科技(中国)有限公司 | UGC label classification determining method and device for social platform |
CN103678670A (en) * | 2013-12-25 | 2014-03-26 | 福州大学 | Micro-blog hot word and hot topic mining system and method |
CN104281882A (en) * | 2014-09-16 | 2015-01-14 | 中国科学院信息工程研究所 | Method and system for predicting social network information popularity on basis of user characteristics |
CN104572733A (en) * | 2013-10-22 | 2015-04-29 | 腾讯科技(深圳)有限公司 | User interest tag classification method and device |
CN104933622A (en) * | 2015-03-12 | 2015-09-23 | 中国科学院计算技术研究所 | Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme |
-
2016
- 2016-09-30 CN CN201610864860.9A patent/CN106446191B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130018697A1 (en) * | 2011-07-15 | 2013-01-17 | Giovanni Giuffrida | System to forecast performance of online news articles to suggest the optimal homepage layout to maximize article readership and readers stickiness |
CN104572733A (en) * | 2013-10-22 | 2015-04-29 | 腾讯科技(深圳)有限公司 | User interest tag classification method and device |
CN103631874A (en) * | 2013-11-07 | 2014-03-12 | 微梦创科网络科技(中国)有限公司 | UGC label classification determining method and device for social platform |
CN103678670A (en) * | 2013-12-25 | 2014-03-26 | 福州大学 | Micro-blog hot word and hot topic mining system and method |
CN104281882A (en) * | 2014-09-16 | 2015-01-14 | 中国科学院信息工程研究所 | Method and system for predicting social network information popularity on basis of user characteristics |
CN104933622A (en) * | 2015-03-12 | 2015-09-23 | 中国科学院计算技术研究所 | Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme |
Non-Patent Citations (1)
Title |
---|
刘列: ""社交网络用户标签预测研究"", 《中文信息学报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106951471A (en) * | 2017-03-06 | 2017-07-14 | 浙江工业大学 | A kind of construction method of the label prediction of the development trend model based on SVM |
CN106951471B (en) * | 2017-03-06 | 2020-05-05 | 浙江工业大学 | SVM-based label development trend prediction model construction method |
CN108629358A (en) * | 2017-03-23 | 2018-10-09 | 北京嘀嘀无限科技发展有限公司 | The prediction technique and device of object type |
CN108629358B (en) * | 2017-03-23 | 2020-12-25 | 北京嘀嘀无限科技发展有限公司 | Object class prediction method and device |
CN110380954A (en) * | 2017-04-12 | 2019-10-25 | 腾讯科技(深圳)有限公司 | Data sharing method and device, storage medium and electronic device |
CN115002030A (en) * | 2022-04-27 | 2022-09-02 | 安徽工业大学 | Website fingerprint identification method and device, storage and processor |
Also Published As
Publication number | Publication date |
---|---|
CN106446191B (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Knowledge mapping of tourism demand forecasting research | |
Jiang et al. | Author topic model-based collaborative filtering for personalized POI recommendations | |
Liu et al. | Analyzing changes in hotel customers’ expectations by trip mode | |
CN103870973B (en) | Information push, searching method and the device of keyword extraction based on electronic information | |
Chang et al. | An improved model for sentiment analysis on luxury hotel review | |
JP5421737B2 (en) | Computer implementation method | |
Zhou et al. | From stay to play–A travel planning tool based on crowdsourcing user-generated contents | |
CN104268292B (en) | The label Word library updating method of portrait system | |
CN107800801A (en) | A kind of pushing learning resource method and system for learning preference based on user | |
CN106600052A (en) | User attribute and social network detection system based on space-time locus | |
CN106294758A (en) | Collaborative recommendation method based on the change of user cognition degree | |
CN106802915A (en) | A kind of academic resources based on user behavior recommend method | |
CN105740366A (en) | Inference method and device of MicroBlog user interests | |
CN105930469A (en) | Hadoop-based individualized tourism recommendation system and method | |
CN103713894B (en) | A kind of method and apparatus for determining the requirements for access information of user | |
CN109584006B (en) | Cross-platform commodity matching method based on deep matching model | |
CN108647800B (en) | Online social network user missing attribute prediction method based on node embedding | |
CN106446191A (en) | Logistic regression based multi-feature network popular tag prediction method | |
CN104077417A (en) | Figure tag recommendation method and system in social network | |
CN103034963B (en) | A kind of service selection system and system of selection based on correlation | |
CN111191099B (en) | User activity type identification method based on social media | |
CN104731958A (en) | User-demand-oriented cloud manufacturing service recommendation method | |
CN106951471A (en) | A kind of construction method of the label prediction of the development trend model based on SVM | |
CN104239496A (en) | Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering | |
CN109684548B (en) | Data recommendation method based on user map |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |