CN106446191A - Logistic regression based multi-feature network popular tag prediction method - Google Patents

Logistic regression based multi-feature network popular tag prediction method Download PDF

Info

Publication number
CN106446191A
CN106446191A CN201610864860.9A CN201610864860A CN106446191A CN 106446191 A CN106446191 A CN 106446191A CN 201610864860 A CN201610864860 A CN 201610864860A CN 106446191 A CN106446191 A CN 106446191A
Authority
CN
China
Prior art keywords
label
tag
network
populartag
unpopulartag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610864860.9A
Other languages
Chinese (zh)
Other versions
CN106446191B (en
Inventor
傅晨波
王金宝
陈风雷
郑永立
靳继伟
宣琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201610864860.9A priority Critical patent/CN106446191B/en
Publication of CN106446191A publication Critical patent/CN106446191A/en
Application granted granted Critical
Publication of CN106446191B publication Critical patent/CN106446191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a Logistic regression based multi-feature network popular tag prediction method. The method includes the following steps: (1) constructing a weighted undirected network Tag network according to question-and-answer website post data; (2) extracting a popular tag set and an unpopular tag set according to tag occurrence frequency; (3) extracting network features of tags, tag proposer attribute characteristics and tag after-propose attribute change characteristics as feature vectors; (4) adopting Logistics multiple regression training, and constructing a tag classification model. Relevance between the tags are taken into consideration, the tags are classified according to multiple features, and precision on prediction of potential popular tags is high. The method is in favor of guiding user to select reasonable tags and beneficial to website designers to provide higher-quality tags.

Description

A kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence
Technical field
The present invention relates to data mining and field of computer technology, more particularly to a kind of many based on Logistic recurrence Character network popular Tag Estimation method.
Background technology
Web tab (Tag) is a kind of organizational form of internet information content, generally closely related with content by some Keyword composition, it can help people easily to describe and categorised content, also allow for the retrieval of information simultaneously and share.By In the convenience of web tab, Tag Estimation and label recommendations have obtained widely should in recent years in numerous network platforms With such as question and answer website StackExchange, photo sharing website Flickr, and food and drink comment website Yelp.Using suitable Label is either still all extremely important for a user to website.For website, suitable label can help website pair User carries out personalized recommendation, increases viscosity and the website clicking rate of user;For a user, label can help user quick Navigate to needed for oneself, it is to avoid lose time to browse garbage.In label is chosen, how to choose potential popular label is ten Divide crucial step, because popular label often represents the demand of most of user.
The Main Basiss that at present information is entered with row label selection are that information is sent out with the word degree of correlation of label and information Play self attributes of person etc..But there are various drawbacks in such selection, be mainly manifested in:1. have ignored label potential popular become Gesture;2. have ignored the correlation between label and label;3. cold content leads to unexpected winner label so that information can not be effective Search;4. only take into account a few features so that part labels selection tend to unilateral.
Therefore, in order that user preferably chooses to label when releasing news content, choose potential as much as possible Popular label.The multiple features network flow row label Forecasting Methodology that the present invention is returned based on Logistic solves following two substantially to ask Topic:(1) predict the following fashion trend of label;(2) apply substantial amounts of feature that the fashion trend of label is carried out with quantitative portraying.
Content of the invention
In order to overcome existing label selecting system to have ignored correlation, evaluation between the potential fashion trend of label and label The single deficiency of feature, the invention provides a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence, Consider not only the correlated characteristic between multiple features and label, the fashion trend of label also can be better anticipated simultaneously.
The technical solution adopted for the present invention to solve the technical problems is as follows:
A kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence, comprises the steps:
S1:Data prediction:Collect the information content of website and label data, and by site information content temporally ascending order Arrangement, the model that ratio is front α % is considered as the Temporal Data before label network is stablized, and deletes this part of Temporal Data; Before choosing from the remaining data in website, the data of preset ratio is as training data;
S2:Build label Tag network, to the Tag occurring in the same information content so as to formation between any two connects side, All information are traveled through, obtains the label network figure G of Undirected networks that has the rightTag, the weight of network is the secondary of both common appearance Number;
S3:The frequency descending that each label occurs in model according to it, the Tag taking front β % ratio is as popular Tag set UPopularTag
S4:Find non-popular tag set UUnPopularTag, to each popular label t ∈ UPopularTag, search for label The time that t occurs for the first time, and centered on this time, search nearest from this time, occur for the first time, be not belonging to simultaneously UPopularTagLabel as non-streaming row label, the non-streaming row label set U of composition comparisonUnPopularTag
S5:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the network characterization of Tag in it, In the Undirected networks G that has the rightTagOn, extract neighbor node angle value, the neighbor node degree center that sample label occurs connecting for the first time Property;
S6:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the presenter extracting Tag in it belongs to Property feature, specifically include Tag presenter propose during this Tag with issue the information content quantity, the length of the information content;
S7:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the attribute extracting Tag in it changes Feature, after specifically including this Tag proposition, the answer quantity that in 5 days, the corresponding model of this Tag receives;
S8:Using Logistic multiple regression, with set U={ UPopularTag,UUnPopularTagIn label feature conduct Training data, trains and builds label classifier model.
Further, in described step S1, the determination mode of α % is, when occurring default the hundred of website whole Tag number of labels Point than when, as the intercept point of α %.Its objective is to guarantee label network be not subject to website to set up at the beginning of staff to website The impact that label debugging causes;
Further, the node angle value of neighbours i in described step S5, is calculated using formula (1)
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
Calculate the node degree centrality of neighbours i using formula (2)
Beneficial effects of the present invention are:Consider correlation between label, label is classified, for pre- according to multiple features Survey potential popular label and there is higher precision.Not only improve guiding user and select rational label, be also beneficial to Web Hosting Person provides higher-quality label.
Brief description
Fig. 1 is a kind of flow chart of multiple features labeling method based on Logistic recurrence of the embodiment of the present invention.
Fig. 2 is the label frequency of occurrences schematic diagram of the embodiment of the present invention.
Specific embodiment
With reference to Figure of description, the specific embodiment of the present invention is described in further detail.
See figures.1.and.2, a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence, the present invention Enter row label using data disclosed in the Tex.Stackexchange.com official of question and answer website StackExchange subnet station to divide The modeling analysis of class system, the time that each model of original data record occurs, post people ID, the information such as model label.With As a example this patent research label Tag, we extract the time that this label occurs for the first time, label presenter ID, neighbours' tag ID Etc. information.
In the present embodiment, a kind of multiple features labeling method based on Logistic recurrence, it concretely comprises the following steps:
1) build label Tag network:To the model data delivered, do following process:
1.1) travel through model data, obtain all of Tag tag set TI, I ∈ N, wherein N represent the total quantity of label. Peek amount be N × 20% label as the number of labels needed for web site tags point of safes, its beneficial manner be prevent website from building Vertical part, staff brings noise to the debugging of web site contents to model;
1.2) by model ascending order arrangement sequentially in time, travel through model data again, when the quantity obtaining different labels During for N × 20%, recording now traversed model number is NInstablePosts, the model time of delivering now is considered as website Label stabilization time;
1.3) determineWherein NPostsFor delivering the total quantity of model;
1.4) build Tag network:The model of α % before removal, reads the model of front 80% data volume in question and answer website data As training data.Wherein, Tag network struction mode is:To the Tag occurring in same model so as to be formed between any two Lian Bian.All information are traveled through, obtains the label network figure G of Undirected networks that has the rightTag, the weight of network is that both are common to be occurred Number of times;
2) obtain popular tag set UPopularTag:To the model data delivered, do following process:
2.1) travel through model data, obtain the frequency that each Tag occurs in model;
2.2) according to Tag frequency of occurrences descending, the Tag taking front β % ratio is as popular tag set UPopularTag, Here, we select β %=5%;
3) obtain non-streaming row label set UUnPopularTag, concretely comprise the following steps:
3.1) to each label Tag, travel through model, obtain the time of occurrence first of each label;
3.2) to each popular label t ∈ UPopularTag, (remaining label is not present in popular to search for remaining labels all In label)With the time difference of this label, i.e. remaining poor Δ T of time of occurrence first with this label;
3.3) ascending order arrangement is carried out to this time difference Δ T, take the minimum label t' of Δ T as non-streaming row label, thus shape Become non-streaming row label set UUnPopularTag
4) extract the network characterization of Tag, concretely comprise the following steps:
4.1) to each label t ∈ { UPopularTag,UUnPopularTag, the node degree of neighbours i is calculated using formula (1) Value
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
4.2) formula (2) is adopted to calculate the node degree centrality of neighbours i
4.3) normalization neighbor node degree, neighbor node degree centrality, normalization denominator is neighbor node numerical value
5) extract sample Tag presenter's attributive character, concretely comprise the following steps:
5.1) to each sample label t ∈ { UPopularTag,UUnPopularTag, when obtaining this label and proposing first, propose No. ID of person, label time of occurrence first;
5.2) by model ascending order arrangement sequentially in time, find out label first before time of occurrence, this presenter ID is total Common enquirement quantity, answer quantity, as Tag presenter's attributive character;
6) extract the attribute variation feature of sample Tag, concretely comprise the following steps:Sample label set U=to training {UPopularTag,UUnPopularTag, after this Tag proposes, the answer quantity that in 5 days, this Tag receives altogether;
7) Logistic multiple regression train classification models:By above-mentioned sample label set U={ UPopularTag, UUnPopularTag, and the neighbor node angle value of corresponding Tag, neighbor node centrad, Tag presenter put question to quantity, Tag Presenter's answer quantity, Tag propose after this 5 features of answer quantity of receiving of certain time as input, with Logistics Label classifier model is trained and built to multiple regression, as grader,;
It is the present invention as mentioned above in the StackExchange subnet station Tex.Stackexchange.com of question and answer website Labeling embodiment introduce, by build network by way of by label between correlation include feature;By considering to mark Sign neighbors feature, consider that the modes such as label presenter's feature, label temporal evolution feature increased the characteristic of labeling. The whether popular judgement of label is finally given by training pattern, provides directive significance to the label recommendations system constructing of website.

Claims (3)

1. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence is it is characterised in that methods described bag Include following steps:
S1:Data prediction:Collect the information content and the label data of website, and temporally ascending order is arranged by site information content Row, the model that ratio is front α % is considered as the Temporal Data before label network is stablized, and deletes this part of Temporal Data;From Before choosing in the remaining data in website, the data of preset ratio is as training data;
S2:Build label Tag network, to the Tag occurring in the same information content so as to formation between any two connects side;To institute There is information to travel through, obtain the label network figure G of Undirected networks that has the rightTag, the weight of network is both common number of times occurring;
S3:The frequency descending that each label occurs in model according to it, the Tag taking front β % ratio is as popular label Set UPopularTag
S4:Find non-popular tag set UUnPopularTag, to each popular label t ∈ UPopularTag, search for label t first The time of secondary appearance, and centered on this time, search nearest from this time, occur for the first time, be not belonging to simultaneously UPopularTagLabel as non-streaming row label, the non-streaming row label set U of composition comparisonUnPopularTag
S5:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the network characterization of Tag in it, having Power Undirected networks GTagOn, extract neighbor node angle value, the neighbor node degree centrality that sample label occurs connecting for the first time;
S6:Sample label set U={ U to trainingPopularTag,UUnPopularTag, the presenter's attribute extracting Tag in it is special Levy, specifically include the quantity that Tag presenter proposes the information content to issue during this Tag, the length of the information content;
S7:Sample label set U={ U to trainingPopularTag,UUnPopularTag, extract the attribute variation feature of Tag in it, After specifically including this Tag proposition, the answer quantity that in 5 days, the corresponding model of this Tag receives;
S8:Using Logistic multiple regression, with set U={ UPopularTag,UUnPopularTagIn label feature as training Data, trains and builds label classifier model.
2. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1, it is special Levy and be:In described step S1, the determination mode of α % is, when the preset percentage of website whole Tag number of labels Wait, as the intercept point of α %.
3. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1 or 2, its It is characterised by:The node angle value of neighbours i in described step S5, is calculated using formula (1)
k i = Σ j = 1 g x i j , ( i ≠ j ) - - - ( 1 )
Wherein, g represents the node total number of network;If node i and j have even side, xij=1, otherwise xij=0;
Calculate the node degree centrality of neighbours i using formula (2)
C D ( i ) = k i g - 1 - - - ( 2 ) .
CN201610864860.9A 2016-09-30 2016-09-30 A kind of multiple features network flow row label prediction technique returned based on Logistic Active CN106446191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610864860.9A CN106446191B (en) 2016-09-30 2016-09-30 A kind of multiple features network flow row label prediction technique returned based on Logistic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610864860.9A CN106446191B (en) 2016-09-30 2016-09-30 A kind of multiple features network flow row label prediction technique returned based on Logistic

Publications (2)

Publication Number Publication Date
CN106446191A true CN106446191A (en) 2017-02-22
CN106446191B CN106446191B (en) 2019-11-05

Family

ID=58169804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610864860.9A Active CN106446191B (en) 2016-09-30 2016-09-30 A kind of multiple features network flow row label prediction technique returned based on Logistic

Country Status (1)

Country Link
CN (1) CN106446191B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951471A (en) * 2017-03-06 2017-07-14 浙江工业大学 A kind of construction method of the label prediction of the development trend model based on SVM
CN108629358A (en) * 2017-03-23 2018-10-09 北京嘀嘀无限科技发展有限公司 The prediction technique and device of object type
CN110380954A (en) * 2017-04-12 2019-10-25 腾讯科技(深圳)有限公司 Data sharing method and device, storage medium and electronic device
CN115002030A (en) * 2022-04-27 2022-09-02 安徽工业大学 Website fingerprint identification method and device, storage and processor

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130018697A1 (en) * 2011-07-15 2013-01-17 Giovanni Giuffrida System to forecast performance of online news articles to suggest the optimal homepage layout to maximize article readership and readers stickiness
CN103631874A (en) * 2013-11-07 2014-03-12 微梦创科网络科技(中国)有限公司 UGC label classification determining method and device for social platform
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method
CN104281882A (en) * 2014-09-16 2015-01-14 中国科学院信息工程研究所 Method and system for predicting social network information popularity on basis of user characteristics
CN104572733A (en) * 2013-10-22 2015-04-29 腾讯科技(深圳)有限公司 User interest tag classification method and device
CN104933622A (en) * 2015-03-12 2015-09-23 中国科学院计算技术研究所 Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130018697A1 (en) * 2011-07-15 2013-01-17 Giovanni Giuffrida System to forecast performance of online news articles to suggest the optimal homepage layout to maximize article readership and readers stickiness
CN104572733A (en) * 2013-10-22 2015-04-29 腾讯科技(深圳)有限公司 User interest tag classification method and device
CN103631874A (en) * 2013-11-07 2014-03-12 微梦创科网络科技(中国)有限公司 UGC label classification determining method and device for social platform
CN103678670A (en) * 2013-12-25 2014-03-26 福州大学 Micro-blog hot word and hot topic mining system and method
CN104281882A (en) * 2014-09-16 2015-01-14 中国科学院信息工程研究所 Method and system for predicting social network information popularity on basis of user characteristics
CN104933622A (en) * 2015-03-12 2015-09-23 中国科学院计算技术研究所 Microblog popularity degree prediction method based on user and microblog theme and microblog popularity degree prediction system based on user and microblog theme

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘列: ""社交网络用户标签预测研究"", 《中文信息学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951471A (en) * 2017-03-06 2017-07-14 浙江工业大学 A kind of construction method of the label prediction of the development trend model based on SVM
CN106951471B (en) * 2017-03-06 2020-05-05 浙江工业大学 SVM-based label development trend prediction model construction method
CN108629358A (en) * 2017-03-23 2018-10-09 北京嘀嘀无限科技发展有限公司 The prediction technique and device of object type
CN108629358B (en) * 2017-03-23 2020-12-25 北京嘀嘀无限科技发展有限公司 Object class prediction method and device
CN110380954A (en) * 2017-04-12 2019-10-25 腾讯科技(深圳)有限公司 Data sharing method and device, storage medium and electronic device
CN115002030A (en) * 2022-04-27 2022-09-02 安徽工业大学 Website fingerprint identification method and device, storage and processor

Also Published As

Publication number Publication date
CN106446191B (en) 2019-11-05

Similar Documents

Publication Publication Date Title
Zhang et al. Knowledge mapping of tourism demand forecasting research
Jiang et al. Author topic model-based collaborative filtering for personalized POI recommendations
Liu et al. Analyzing changes in hotel customers’ expectations by trip mode
CN103870973B (en) Information push, searching method and the device of keyword extraction based on electronic information
Chang et al. An improved model for sentiment analysis on luxury hotel review
JP5421737B2 (en) Computer implementation method
Zhou et al. From stay to play–A travel planning tool based on crowdsourcing user-generated contents
CN104268292B (en) The label Word library updating method of portrait system
CN107800801A (en) A kind of pushing learning resource method and system for learning preference based on user
CN106600052A (en) User attribute and social network detection system based on space-time locus
CN106294758A (en) Collaborative recommendation method based on the change of user cognition degree
CN106802915A (en) A kind of academic resources based on user behavior recommend method
CN105740366A (en) Inference method and device of MicroBlog user interests
CN105930469A (en) Hadoop-based individualized tourism recommendation system and method
CN103713894B (en) A kind of method and apparatus for determining the requirements for access information of user
CN109584006B (en) Cross-platform commodity matching method based on deep matching model
CN108647800B (en) Online social network user missing attribute prediction method based on node embedding
CN106446191A (en) Logistic regression based multi-feature network popular tag prediction method
CN104077417A (en) Figure tag recommendation method and system in social network
CN103034963B (en) A kind of service selection system and system of selection based on correlation
CN111191099B (en) User activity type identification method based on social media
CN104731958A (en) User-demand-oriented cloud manufacturing service recommendation method
CN106951471A (en) A kind of construction method of the label prediction of the development trend model based on SVM
CN104239496A (en) Collaborative filtering method based on integration of fuzzy weight similarity measurement and clustering
CN109684548B (en) Data recommendation method based on user map

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant