CN106446191A

CN106446191A - Logistic regression based multi-feature network popular tag prediction method

Info

Publication number: CN106446191A
Application number: CN201610864860.9A
Authority: CN
Inventors: 傅晨波; 王金宝; 陈风雷; 郑永立; 靳继伟; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2017-02-22
Anticipated expiration: 2036-09-30
Also published as: CN106446191B

Abstract

A multi-feature network popular label prediction method based on Logistic regression, comprising the following steps: 1) constructing a network of authorized and undirected network label Tags according to the posting data of a question-and-answer website; 2) extracting popular label sets, non-popular Label collection; 3) extracting the network features of the label, the attribute characteristics of the label proposer, and the attribute change characteristics after the label is proposed; 4) using Logistics multiple regression training and building a label classification model; the present invention considers the correlation between labels, Classifying labels based on multiple features has high accuracy for predicting potential popular labels. It is not only beneficial to guide users to choose reasonable labels, but also beneficial for website builders to provide higher quality labels.

Description

A Multi-feature Network Popular Label Prediction Method Based on Logistic Regression

技术领域technical field

本发明涉及数据挖掘与计算机技术领域，特别是涉及一种基于Logistic回归的多特征网络流行标签预测方法。The invention relates to the fields of data mining and computer technology, in particular to a Logistic regression-based multi-feature network popular label prediction method.

背景技术Background technique

网络标签(Tag)是互联网信息内容的一种组织形式，通常由一些与内容密切相关的关键词组成，它可以帮助人们方便地描述和分类内容，同时也便于信息的检索与分享。由于网络标签的便捷性，标签预测以及标签推荐近年来在众多网络平台上得到了广泛的应用，如问答网站StackExchange，照片分享网站Flickr，以及餐饮点评网站Yelp。采用合适的标签无论是对网站还是对用户而言都非常重要。对网站而言，合适的标签可以帮助网站对用户进行个性化推荐，增加用户的粘性和网站点击率；对用户而言，标签可以帮助用户快速定位到自己所需，避免浪费时间浏览无用信息。在标签选取中，如何选取潜在流行标签是十分关键的步骤，因为流行标签往往代表了大部分用户的需求。Network tags (Tag) is an organizational form of Internet information content, usually composed of some keywords closely related to content, it can help people describe and classify content conveniently, and also facilitate information retrieval and sharing. Due to the convenience of network tags, tag prediction and tag recommendation have been widely used in many network platforms in recent years, such as StackExchange, a question-and-answer website, Flickr, a photo-sharing website, and Yelp, a restaurant review website. Using the right tags is very important both for the website and for the users. For websites, appropriate tags can help websites make personalized recommendations to users, increasing user stickiness and website click-through rate; for users, tags can help users quickly locate what they need and avoid wasting time browsing useless information. In tag selection, how to select potentially popular tags is a critical step, because popular tags often represent the needs of most users.

目前对信息进行标签选取的主要依据是信息与标签的文字相关程度以及信息发起者的自身属性等。但这样的选取存在各种弊端，主要表现在：1.忽略了标签的潜在流行趋势；2.忽略了标签与标签之间的相关性；3.冷门内容导致冷门标签，使得信息并不能被有效搜索到；4.只考虑到少数特征，使得部分标签的选取趋向与片面。At present, the main basis for selecting information tags is the degree of correlation between the information and the text of the tag and the attributes of the information originator. However, there are various disadvantages in this selection, mainly in: 1. Ignoring the potential trend of tags; 2. Ignoring the correlation between tags; 3. Unpopular content leads to unpopular tags, making information not effective Searched; 4. Only a few features are considered, which makes the selection of some labels tend to be one-sided.

因此，为了使用户在发布信息内容时更好地对标签进行选取，尽可能地选取潜在流行标签。本发明基于Logistic回归的多特征网络流行标签预测方法解决以下两个基本问题：(1)预测了标签的未来流行趋势；(2)应用大量的特征对标签的流行趋势进行定量刻画。Therefore, in order to enable users to better select tags when publishing information content, potential popular tags should be selected as much as possible. The multi-feature network popular label prediction method based on Logistic regression in the present invention solves the following two basic problems: (1) predicts the future popular trend of the label; (2) applies a large number of features to quantitatively describe the popular trend of the label.

发明内容Contents of the invention

为了克服现有的标签选取系统忽略了标签潜在流行趋势及标签之间相关性、评价特征单一的不足，本发明提供了一种基于Logistic回归的多特征网络流行标签预测方法，不仅考虑到多个特征及标签之间的相关特征，同时也能更好地预测了标签的流行趋势。In order to overcome the shortcomings of the existing label selection system that ignores the potential trend of labels and the correlation between labels, and has a single evaluation feature, the present invention provides a multi-feature network popular label prediction method based on Logistic regression, which not only considers multiple The related features between features and tags can also better predict the popular trend of tags.

本发明解决其技术问题所采用的技术方案如下：The technical solution adopted by the present invention to solve its technical problems is as follows:

一种基于Logistic回归的多特征网络流行标签预测方法，包括如下步骤：A multi-feature network popular label prediction method based on Logistic regression, comprising the steps of:

S1：数据预处理：收集网站的信息内容和标签数据，并将网站信息内容按时间升序排列，将比例为前α％的帖子视为标签网络稳定前的暂态数据，并删除这一部分暂态数据；从网站剩下的数据中选取前预设比例的数据作为训练数据；S1: Data preprocessing: collect the information content and label data of the website, and arrange the information content of the website in ascending order of time, regard the posts with a proportion of the first α% as the transient data before the label network is stable, and delete this part of the transient state Data; select the data of the preset ratio from the remaining data of the website as the training data;

S2：构建标签Tag网络，对同一个信息内容中出现的Tag，使其两两之间形成连边，对所有信息遍历，得到有权无向网络的标签网络图G_Tag，网络的权重为两者共同出现的次数；S2: Construct a Tag Network. For Tags that appear in the same information content, make them form an edge between them. Traverse all the information to get the tag network graph G _Tag with the right to undirected network. The weight of the network is two the number of co-occurrences;

S3：每个标签按照其在帖子中出现的频率降序排列，取前β％比例的Tag作为流行标签集合U_PopularTag；S3: Each tag is arranged in descending order according to its frequency of appearance in the post, and the Tag with the first β% ratio is taken as the popular tag set U _PopularTag ;

S4：寻找非流行的标签集合U_UnPopularTag，对每一个流行标签t∈U_PopularTag，搜索标签t第一次出现的时间，并以此时间为中心，搜寻离该时间最近的，第一次出现的，同时不属于U_PopularTag的标签作为非流行标签，组成对照的非流行标签集合U_UnPopularTag；S4: Find the unpopular tag set U _UnPopularTag , for each popular tag t ∈ U _PopularTag , search for the time when the tag t first appeared, and use this time as the center to search for the closest to the time, the first time , and tags that do not belong to U _PopularTag are used as unpopular tags to form a comparative non-popular tag set U _UnPopularTag ;

S5：对训练的样本标签集合U＝{U_PopularTag,U_UnPopularTag}，提取其内Tag的网络特征，在有权无向网络G_Tag上，提取样本标签第一次出现连接的邻居节点度值、邻居节点度中心性；S5: For the training sample label set U={U _PopularTag , U _UnPopularTag }, extract the network features of the tags in it, and extract the degree value of the neighbor node connected to the sample label for the first time on the weighted undirected network G _Tag , Neighbor node degree centrality;

S6：对训练的样本标签集合U＝{U_PopularTag,U_UnPopularTag}，提取其内Tag的提出者属性特征，具体包括Tag提出者提出该Tag时的以发布的信息内容的数量，信息内容的长度；S6: For the training sample label set U={U _PopularTag , U _UnPopularTag }, extract the attribute characteristics of the proposer of the Tag in it, specifically including the number of information content published when the Tag proposer proposes the Tag, and the length of the information content ;

S7：对训练的样本标签集合U＝{U_PopularTag,U_UnPopularTag}，提取其内Tag的属性变动特征，具体包括该Tag提出后，5天内该Tag对应的帖子收到的答复数量；S7: For the training sample label set U={U _PopularTag , U _UnPopularTag }, extract the attribute change characteristics of the Tag in it, specifically including the number of replies received by the post corresponding to the Tag within 5 days after the Tag is put forward;

S8：采用Logistic多元回归，以集合U＝{U_PopularTag,U_UnPopularTag}中标签的特征作为训练数据，训练并构建标签分类器模型。S8: Using Logistic multiple regression, using the features of the tags in the set U={U _PopularTag , U _UnPopularTag } as training data, train and build a tag classifier model.

进一步，所述步骤S1中，α％的确定方式为，当出现网站全部Tag标签数量的预设百分比时候，作为α％的截取点。其目的是确保标签网络不受到网站建立之初工作人员对网站标签调试造成的影响；Further, in the step S1, α% is determined in such a way that when a preset percentage of the total number of Tags on the website appears, it is used as an intercept point of α%. Its purpose is to ensure that the label network is not affected by the staff's debugging of the website label at the beginning of the website's establishment;

再进一步，所述步骤S5中，采用公式(1)计算邻居i的节点度值Further, in the step S5, formula (1) is used to calculate the node degree value of neighbor i

其中，g表示网络的节点总数；如果节点i和j有连边，则x_ij＝1，否则x_ij＝0；Among them, g represents the total number of nodes in the network; if nodes i and j have connected edges, then x _ij =1, otherwise x _ij =0;

采用公式(2)计算邻居i的节点度中心性Use formula (2) to calculate the node degree centrality of neighbor i

本发明的有益效果为：考虑标签之间相关性，依据多特征对标签进行分类，对于预测潜在流行标签具有较高的精度。既有利于引导用户选择合理的标签，也有利于网站建设者提供更高质量的标签。The beneficial effects of the present invention are: considering the correlation between tags, classifying tags according to multiple features, and having higher precision for predicting potential popular tags. It is not only beneficial to guide users to choose reasonable labels, but also beneficial for website builders to provide higher quality labels.

附图说明Description of drawings

图1为本发明实施例的一种基于Logistic回归的多特征标签分类方法的流程图。FIG. 1 is a flow chart of a multi-feature label classification method based on Logistic regression according to an embodiment of the present invention.

图2为本发明实施例的标签出现频率示意图。Fig. 2 is a schematic diagram of tag occurrence frequency according to an embodiment of the present invention.

具体实施方式detailed description

下面结合说明书附图对本发明的具体实施方式作进一步详细的描述。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

参照图1和图2，一种基于Logistic回归的多特征网络流行标签预测方法，本发明使用问答网站StackExchange子网站Tex.Stackexchange.com官方公开的数据进行标签分类系统的建模分析，原始数据记录了每个帖子出现的时间，发帖人ID，帖子标签等信息。以本专利研究标签Tag为例，我们提取该标签第一次出现的时间，标签提出者ID，邻居标签ID等信息。Referring to Fig. 1 and Fig. 2, a multi-feature network popular label prediction method based on Logistic regression, the present invention uses the data officially released by the question-and-answer website StackExchange sub-site Tex.Stackexchange.com to carry out the modeling analysis of the label classification system, and the original data records Information such as the time when each post appeared, the ID of the poster, and the label of the post. Taking the Tag of this patent research as an example, we extract the time when the tag first appeared, ID of the tag creator, ID of neighbor tags and other information.

本实施例中，一种基于Logistic回归的多特征标签分类方法，其具体步骤为：In the present embodiment, a kind of multi-feature label classification method based on Logistic regression, its specific steps are:

1)构建标签Tag网络：对发表过的帖子数据，做如下处理：1) Build a tag Tag network: For the published post data, do the following processing:

1.1)遍历帖子数据，得到所有的Tag标签集合T_I,I∈N，其中N表示标签的总数量。取数量为N×20％的标签作为网站标签稳定点所需的标签数量，其有益方式为防止网站建立之处，工作人员对网站内容的调试给模型带来噪声；1.1) Traverse the post data to get all the Tag label sets T _I , I∈N, where N represents the total number of tags. Take N×20% of the tags as the number of tags required for the stable point of the website tags. The beneficial way is to prevent the site from being built, and the staff’s debugging of the website content will bring noise to the model;

1.2)将帖子按照时间顺序升序排列，再次遍历帖子数据，当得到不同标签的数量为N×20％时，记录此时遍历过的帖子数目为N_{InstablePosts}，将此时的帖子发表时间视为网站标签稳定时间；1.2) Arrange the posts in ascending order of time, and traverse the post data again. When the number of different tags is N×20%, record the number of posts traversed at this time as N _{InstablePosts} , and regard the post publishing time at this time as the website label stabilization time;

1.3)确定其中N_Posts为发表帖子的总数量；1.3) OK Where N _Posts is the total number of published posts;

1.4)构建Tag网络：去除前α％的帖子，读取问答网站数据中前80％数据量的帖子作为训练数据。其中，Tag网络构建方式为：对同一个帖子中出现的Tag，使其两两之间形成连边。对所有信息遍历，得到有权无向网络的标签网络图G_Tag，网络的权重为两者共同出现的次数；1.4) Build a Tag network: remove the first α% posts, and read the first 80% of the posts in the Q&A website data as training data. Among them, the construction method of the Tag network is as follows: for the Tags appearing in the same post, make them form a connection between them. For all information traversal, the label network graph G _Tag of the authorized undirected network is obtained, and the weight of the network is the number of times the two appear together;

2)获取流行标签集合U_PopularTag：对发表过的帖子数据，做如下处理：2) Obtain the popular tag set U _PopularTag : For the published post data, do the following processing:

2.1)遍历帖子数据，获取每个Tag在帖子中出现的频率；2.1) Traverse the post data to obtain the frequency of each Tag appearing in the post;

2.2)按照Tag出现频率降序排列，取前β％比例的Tag作为流行标签集合U_PopularTag，这里，我们选择β％＝5％；2.2) According to the descending order of Tag appearance frequency, take the Tag with the first β% ratio as the popular tag set U _PopularTag , here, we choose β%=5%;

3)获取非流行标签集合U_UnPopularTag，具体步骤为：3) Obtain the unpopular tag set U _UnPopularTag , the specific steps are:

3.1)对每一个标签Tag，遍历帖子，得到每一个标签的首次出现时间；3.1) For each label Tag, traverse the post to get the first appearance time of each label;

3.2)对每一个流行标签t∈U_PopularTag，搜索所有其余标签(其余标签不存在于流行标签内)与该标签的时间差，即其余与该标签的首次出现时间差ΔT；3.2) For each popular tag t ∈ U _PopularTag , search for all other tags (other tags do not exist in popular tags) The time difference with the label, that is, the time difference ΔT between the rest and the first appearance of the label;

3.3)对该时间差ΔT进行升序排列，取ΔT最小的标签t'作为非流行标签，从而形成非流行标签集合U_UnPopularTag 3.3) Arrange the time difference ΔT in ascending order, and take the tag t' with the smallest ΔT as the unpopular tag, thus forming the unpopular tag set U _UnPopularTag

4)提取Tag的网络特征，具体步骤为：4) Extract the network features of Tag, the specific steps are:

4.1)对每一个标签t∈{U_PopularTag,U_UnPopularTag}，采用公式(1)计算邻居i的节点度值4.1) For each tag t ∈ {U _PopularTag , U _UnPopularTag }, use formula (1) to calculate the node degree value of neighbor i

4.2)采用公式(2)计算邻居i的节点度中心性4.2) Use formula (2) to calculate the node degree centrality of neighbor i

4.3)归一化邻居节点度、邻居节点度中心性，归一化分母为邻居节点数值4.3) Normalized neighbor node degree, neighbor node degree centrality, normalized denominator is neighbor node value

5)提取样本Tag提出者属性特征，具体步骤为：5) Extract the attribute characteristics of the sample Tag proposer, the specific steps are:

5.1)对每一个样本标签t∈{U_PopularTag,U_UnPopularTag}，获得该标签首次提出时，提出者的ID号、标签首次出现时间；5.1) For each sample tag t ∈ {U _PopularTag , U _UnPopularTag }, obtain the ID number of the proposer and the time when the tag first appeared when the tag was proposed for the first time;

5.2)将帖子按照时间顺序升序排列，找出标签首次出现时间之前，该提出者ID总共的提问数量、答案数量，作为Tag提出者属性特征；5.2) Arrange the posts in ascending chronological order, and find out the total number of questions and answers of the presenter ID before the time when the tag first appeared, as the attribute characteristics of the presenter of the Tag;

6)提取样本Tag的属性变动特征，具体步骤为：对训练的样本标签集合U＝{U_PopularTag,U_UnPopularTag}，在该Tag提出后，5天内该Tag共收到的答案数量；6) extracting the attribute change feature of the sample Tag, the specific steps are: for the sample tag set U={U _PopularTag , U _UnPopularTag } for training, after the Tag is proposed, the number of answers received by the Tag within 5 days;

7)Logistic多元回归训练分类模型：将上述样本标签集合U＝{U_PopularTag,U_UnPopularTag}，以及相对应的Tag的邻居节点度值、邻居节点中心度、Tag提出者提问数量、Tag提出者答案数量、Tag提出后一定时间收到的答案数量这5个特征作为输入，运用Logistics多元回归作为分类器，训练并构建标签分类器模型；7) Logistic multiple regression training classification model: the above sample label set U={U _PopularTag , U _UnPopularTag }, and the corresponding Tag's neighbor node degree value, neighbor node centrality, the number of questions asked by the Tag proposer, and the answer of the Tag proposer The five features of the quantity and the number of answers received within a certain period of time after the Tag is proposed are used as input, and the Logistic multiple regression is used as the classifier to train and build a tag classifier model;

如上所述为本发明在问答网站StackExchange子网站Tex.Stackexchange.com中的标签分类实施例介绍，通过构建网络的方式将标签之间的相关性纳入特征；通过考虑标签邻居特征、考虑标签提出者特征、标签时间演化特征等方式增加了标签分类的特征数据。通过训练模型最终得到标签是否流行的判定，对网站的标签推荐系统构建提供指导意义。As mentioned above, the label classification embodiment of the present invention in the question-and-answer website StackExchange sub-site Tex.Stackexchange.com is introduced, and the correlation between labels is included in the feature by constructing a network; Features, label time evolution features, etc. increase the feature data of label classification. Through the training model, the judgment of whether the tag is popular can be finally obtained, which provides guidance for the construction of the tag recommendation system of the website.

Claims

1. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence is it is characterised in that methods described bag Include following steps：

S1：Data prediction：Collect the information content and the label data of website, and temporally ascending order is arranged by site information content Row, the model that ratio is front α % is considered as the Temporal Data before label network is stablized, and deletes this part of Temporal Data；From Before choosing in the remaining data in website, the data of preset ratio is as training data；

S2：Build label Tag network, to the Tag occurring in the same information content so as to formation between any two connects side；To institute There is information to travel through, obtain the label network figure G of Undirected networks that has the right_Tag, the weight of network is both common number of times occurring；

S3：The frequency descending that each label occurs in model according to it, the Tag taking front β % ratio is as popular label Set U_PopularTag；

S4：Find non-popular tag set U_UnPopularTag, to each popular label t ∈ U_PopularTag, search for label t first The time of secondary appearance, and centered on this time, search nearest from this time, occur for the first time, be not belonging to simultaneously U_PopularTagLabel as non-streaming row label, the non-streaming row label set U of composition comparison_UnPopularTag；

S5：Sample label set U={ U to training_PopularTag,U_UnPopularTag, extract the network characterization of Tag in it, having Power Undirected networks G_TagOn, extract neighbor node angle value, the neighbor node degree centrality that sample label occurs connecting for the first time；

S6：Sample label set U={ U to training_PopularTag,U_UnPopularTag, the presenter's attribute extracting Tag in it is special Levy, specifically include the quantity that Tag presenter proposes the information content to issue during this Tag, the length of the information content；

S7：Sample label set U={ U to training_PopularTag,U_UnPopularTag, extract the attribute variation feature of Tag in it, After specifically including this Tag proposition, the answer quantity that in 5 days, the corresponding model of this Tag receives；

S8：Using Logistic multiple regression, with set U={ U_PopularTag,U_UnPopularTagIn label feature as training Data, trains and builds label classifier model.

2. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1, it is special Levy and be：In described step S1, the determination mode of α % is, when the preset percentage of website whole Tag number of labels Wait, as the intercept point of α %.

3. a kind of multiple features network flow row label Forecasting Methodology based on Logistic recurrence as claimed in claim 1 or 2, its It is characterised by：The node angle value of neighbours i in described step S5, is calculated using formula (1)

k_{i} = Σ_{j = 1}^{g} x_{i j}, (i &NotEqual; j) - - - (1)

Wherein, g represents the node total number of network；If node i and j have even side, x_ij=1, otherwise x_ij=0；

Calculate the node degree centrality of neighbours i using formula (2)

C_{D} (i) = \frac{k_{i}}{g - 1} - - - (2) .