CN106951471B

CN106951471B - SVM-based label development trend prediction model construction method

Info

Publication number: CN106951471B
Application number: CN201710127478.4A
Authority: CN
Inventors: 傅晨波; 郑永立; 李诗迪; 宣琦
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2020-05-05
Anticipated expiration: 2037-03-06
Also published as: CN106951471A

Abstract

A construction method of a label development trend prediction model based on an SVM (support vector machine) comprises the following steps: (1) preprocessing a data set, counting post data of a website, and removing non-related data information; (2) selecting sample labels, counting the frequency of the labels after two years of new appearance, and extracting a popular label set and a non-popular label set; (3) constructing a directed network of tags; (4) extracting label characteristic data including network characteristics and related attribute characteristics of the label as training test data; (5) and training the data by adopting a Support Vector Machine (SVM) method, and constructing a label prevalence trend prediction model. The method considers the correlation among the labels, carries out prediction classification on the future development trend of the labels by combining the attribute characteristics with the network characteristics, and has higher precision for predicting potential popular labels. The method is not only beneficial to guiding the user to select reasonable labels, but also beneficial to providing higher-quality labels for website builders.

Description

SVM-based label development trend prediction model construction method

Technical Field

The invention relates to data mining and data analysis technologies, in particular to a construction method of a label development trend prediction model based on an SVM (support vector machine).

Background

With the rapid development of networks, more and more people choose to exchange information through the networks, but a large amount of information is simultaneously poured in, so that users are difficult to rapidly and efficiently screen the information, and therefore, network tags appear. The advent of network tags has greatly solved this problem. The label is composed of keywords closely related to the content, and can help people to conveniently describe and classify the content and facilitate information retrieval and sharing.

Meanwhile, the development trend and classification prediction of the tags are more and more concerned by people, and the popularity trend of the new tags after being proposed is often representative of the popularity trend of hotspots or directions in the field, which is a problem of great attention of website communities. For a website, the trend prediction and the label recommendation of a new label are effectively carried out, and the development of topics or emerging fields can be promoted. For the user, searching the content according to the popularity trend of the label can accurately find the development trend of the current field.

At present, the main basis for selecting the label of the information is the correlation degree of the information and the character of the label, the self attribute of the information initiator and the like. However, there are some disadvantages, mainly expressed in: (1) neglecting the potential prevalence trend of new tags; (2) correlation between tags is ignored; (3) cold content results in cold tags, making the information effectively searchable; (4) only a few characteristics are considered, so that the selection of partial labels tends to be one-sided.

Therefore, in order to enable the user to better select the tags when publishing information, the tags with potential popularity are selected as much as possible. The invention provides a construction method of a label development trend prediction model based on an SVM (support vector machine), which solves the following two basic problems: (1) extracting network characteristics and related attribute characteristics at the initial stage of label formation to quantitatively depict the development trend of the label; (2) and predicting the future development trend of the new label.

Disclosure of Invention

In order to improve the management of a website on network community tags and the prediction of the development trend of new tags, the defect of the current prediction on the popularity of the tags is overcome. The invention provides a construction method of a label development trend prediction model based on an SVM (support vector machine), which not only combines network characteristics among labels, but also extracts attribute characteristics of the labels in an early stage to train and predict.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a construction method of a label development trend prediction model based on an SVM (support vector machine) comprises the following steps:

step 1: data preprocessing, namely collecting information content of a website community and corresponding label data, sequencing the data content according to time, and taking data after the community is formed for N days to ensure that a label network of the community is formed preliminarily;

step 2, selecting sample labels, counting the data set, obtaining community label frequency and sequencing, taking the labels with the first α% as popular labels, and recording the collection as U_pop(ii) a Selecting a label which is compared with the time of the popular label as a non-popular label from the rest labels;

and step 3: and constructing a label network, and regarding a plurality of labels appearing in the same information content, considering that the labels have a relationship, so that a connecting edge is formed between every two labels. Traversing all information in the website community to obtain a label network graph G with the right to have network without direction_TagWherein, the node is a new label, the connecting edge is the relation between labels, and the weight of the network is the frequency of the common appearance of the node and the label;

and 4, step 4: extracting characteristic data, and setting the sample label set U as { U ═ U_pop,U_unpopExtracting network characteristics and attribute characteristics M days after the first creation of the inner label, and establishing a sample training data set;

and 5: and (3) adopting a machine learning classifier model to support a vector machine (SVM), selecting a kernel function, training to generate a popular label prediction model based on the SVM, and performing cross validation by ten folds to obtain the test precision.

Further, in the step 1, data after N days is selected as preprocessed data, wherein the selection of N follows the rule: it is ensured that the first 10% of the tag data in the web site has been generated within N days, i.e. the tag network in the web site has been preliminarily formed.

Further, in the step 2, the sample tag data is selected, the tags are arranged in a frequency descending order, and the set is recorded as

Selecting

The labels with the middle proportion of top α% serve as popular labels, and the set of the popular labels is marked as U_popTaking all the labels with the label proportion of post β% as a non-popular label set, and recording the set as Q_unpop. For each popular label t_pop∈U_popSearch and tag t_popThe tag with the latest creation time of (1) is marked as t_unpopWhile satisfying t_unpop∈Q_unpopAs a non-popular label, the label,in contrast to the popular tag data, the set is denoted as U_unpop；

Further, in step 4, for extracting the network features of the tags, M is 30, and the network features mainly include:

1) relative centrality within 30 days after new label submission: label t_iValue D of_iThe calculation adopts a mode of removing isolated nodes, and the calculation formula is as follows:

wherein N represents the total number of tags in the network; a is_ijElements representing the network adjacency matrix, if the label t_iAnd t_jWith connecting edges, then a_ij1 is ═ 1; otherwise a_ij＝0；

Label t_iCalculating the characteristic of degree centrality, and taking the label t in the network_iRelative centrality of (a):

wherein D is_iIndicates the label t_iA value of (d);

2) neighbor mean centrality, tag t, within 30 days after new tag is proposed_iOf (2) neighbor mean degree NC_iIs calculated as follows:

wherein N is_neighborIndicates the label t_iThe number of the neighbor nodes of (1),

indicates the label t_iThe sum of the neighbor node values;

3) relative recentness of approach within 30 days after new label extraction, label t_iThe approximate centrality of the label t is also calculated_iRelative recenterness of (d):

wherein d is_ijIndicates the label t_iAnd a label t_jThe distance of (a) to (b),

indicates the label t_iAverage geodesic distance to neighbor tag nodes;

4) feature vector centrality, tag t, within 30 days after new tag extraction_iThe feature vector centrality of (2) is calculated as follows:

wherein η is a proportionality constant, and A ═ a_ijw_ij) Is a weighted network adjacency matrix, where w_ijIndicates the label t_iAnd t_jAnd has a weight of w_ij＝w_ji. Let x be ═ x₁x₂… x_N]^TThen equation (5) can be written in the form of a matrix as follows:

x＝ηAx， (6)

x is the maximum eigenvalue η of the modulus of the matrix A^-1The feature vector under the correspondence is also called as feature vector centrality;

5) node clustering coefficient within 30 days after new label extraction, label t_iThe clustering coefficient of (c) is calculated as follows:

wherein E is_iIndicates the label t_iK of (a)_iNumber of edges, k, actually existing between the neighboring label nodes_i(k_i-1)/2 represents a label t_iK of (a)_iThe maximum number of edges that may exist between neighboring nodes.

In step 4, the attribute feature extraction includes: 4.1) all answers to the question are included within 30 days after the new label is presented; 4.2) average number of answers and average number of questions and average time lapse before all the contributors and respondents participating in the tag for 30 days; 4.3) average question answer response time of the label within 30 days; 4.4) the number of all participating users of the tag within 30 days, i.e. the sum of the questioners and the respondents of the question; 4.5) average number of words containing all the questions of the tag within 30 days; 4.6) counting the number of praise of all problems in 30 days of the label;

the calculation method of the average answer response time of the questions of the labels is as follows:

let 30 days contain the label t_iThe number of problems of

Label t within 30 days_iThe number of answers to the s-th question of

Label t_iS question creation time

Counting the creation time of the v-th answer

Calculating the response time difference, and averaging the difference values of all the questions and answers

The calculation formula is as follows:

in the step 5, the support vector machine SVM two-classification model is constructed by the following process:

first, the selection of the kernel function is determined using a Gaussian kernel RBF, i.e., sample t_iAnd t_jBy using the inner products of the feature space to pass through the original sample spaceFunction k (t)_i,t_j) Calculated, the expression is as follows:

where δ represents the bandwidth of the gaussian kernel.

And then searching the optimal parameter value of the SVM model through a grid algorithm, performing ten-fold cross validation, performing multiple tests and averaging to obtain the precision index of the SVM-based label prevalence trend prediction model.

The invention has the beneficial effects that: compared with the prior art, the SVM-based label development trend prediction model can predict the development trend of the newly appeared label, the problem of neglecting the newly appeared cold label in label recommendation is solved, and the label recommendation is more reasonable and effective.

Drawings

FIG. 1 is a flow chart of the programming of the present invention;

FIG. 2 is a construction process of a label trend prediction model based on SVM in the invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in connection with the accompanying drawings.

Referring to fig. 1 and 2, the invention provides a construction method of a label development trend prediction model based on an SVM (support vector machine). according to the method, instance analysis is performed on a Stackoverflow data set, and original data comprises information such as creation time of each post, post ID, user ID, post label and the like. Taking a label of a problem as an example in the patent, we extract the first creation time of the label, the ID of a label presenter, information of its neighbor labels, and the like.

The invention is divided into the following five steps:

step 1: screening and preprocessing a data set;

step 2: selecting sample label data;

and step 3: constructing a label network;

and 4, step 4: extracting characteristic data of the sample label;

and 5: and constructing and training a prediction model based on the SVM label prevalence trend.

In the step 1, the specific operation process is as follows: selecting information content and corresponding label data of the website, selecting the website to start 3 months after the website is established, preliminarily forming a label network of the website, counting the frequency of newly appeared labels, and then sequencing;

in the step 2, the specific operation process of the screening of the sample label data is as follows:

firstly, selecting popular label samples, sorting label frequencies in a descending order, and recording the label frequencies as a set

Selecting

The labels with the middle proportion of the first 5 percent are taken as popular labels, and the set of the popular labels is marked as U_pop；

Secondly, selecting non-popular label samples and taking a set

The tags with the middle proportion of the last 85 percent are taken as a non-popular tag set, and the set is marked as Q_unpop. For each popular label t_pop∈U_popSearch and tag t_popTag t with the latest creation time of_unpopWhile satisfying t_unpop∈Q_unpopAs a non-popular label, i.e. forming a temporal contrast, the set is denoted as U_unpop(ii) a Finally, taking U as { U ═ U-_pop,U_unpopAs sample label data;

in the step 3, the specific operation process of constructing the tag network is as follows: traversing all information contents of the community data, and if the tags appear in the same information record at the same time, indicating that the two tags have connection, namely the two tags have connection edges, so as to construct the authorized and undirected network G of the tags_TagThe weight represents the number of times two tags appear simultaneously.

Said step (c) isIn 4, extracting the tag network feature data, as shown in fig. 1, the sample tag set U ═ U_pop,U_unpopIn the direction of the network G with the right to have no right_TagUpper, extract its inner label t_iAt the time of first proposal

The network characteristics of the next M days, M is 30,. The specific operation process is as follows:

wherein N represents the total number of tags of the network; a is_ijElements representing the network adjacency matrix, if the label t_iAnd t_jWith connecting edges, then a_ij1 is ═ 1; otherwise a_ij＝0；

wherein D is_iIndicates the label t_iA value of (d);

indicates the label t_iSum of neighbor node degree values。

indicates the label t_iAverage geodesic distance to neighboring tag nodes.

wherein η is a proportionality constant, and A ═ a_ijw_ij) Is a weighted network adjacency matrix, where w_ijIndicates the label t_iAnd t_jAnd has a weight of w_ij＝w_ji. Let x be ═ x₁x₂… x_N]^TThen equation (14) can be written in the form of a matrix as follows:

x＝ηAx， (6)

x is the maximum eigenvalue η of the modulus of the matrix A^-1The feature vector under the correspondence is also referred to as feature vector centrality.

wherein E is_iIndicates the label t_iK of (a)_iNumber of edges, k, actually existing between the neighboring nodes_i(k_i-1)/2 represents a label t_iK of (a)_iThe maximum number of edges that may exist between neighboring nodes.

In the step 4, the sample label data attribute features are extracted, and the label t is subjected to extraction_iE.g. U, extracting its first extraction time

The following characterization procedure for 30 days was as follows:

4.1) including the tag t within 30 days of extraction_iAll questions in (1), the collection of which is noted

4.2) finding a set of problems

The problem in (1) is solved, the set is recorded as

All respondents in question, in the aggregate

The total number of praise in all problems is recorded as

4.3) statistics include the tag t_iProblem of all the problems mentioned

And respondents

Average answer number before the current time, average question data;

4.4) statistics of the tag t_iThe corresponding average number of praise questions, and the average number of participators of the label, namely the sum of the number of respondents and the number of presenters.

4.5) SystemCalculating the average answer response time of the questions corresponding to the labels within 30 days, and including the labels t within 30 days_iThe number of problems of

Tag t within 30 days_iThe number of answers to the s-th question of

Label t_iS question creation time

Counting the creation time of the v-th answer

The calculation formula is as follows:

in the step 5, the construction and training of the label prevalence trend prediction model based on the SVM have the following specific operation processes: first, the selection of the kernel function is determined using a Gaussian kernel RBF, i.e., sample t_iAnd t_jInner products between feature spaces using them in the original sample space through a function k (t)_i,t_j) Calculated, the expression is as follows:

where δ represents the bandwidth of the gaussian kernel;

and then searching the optimal parameter value of the SVM model through a grid algorithm, and then performing a ten-fold cross validation mode, namely randomly dividing the data into 10 parts, sequentially taking 1 part as a test sample, and taking the remaining 9 parts as training samples to obtain the SVM-based label prevalence trend prediction model.

As described above, by constructing the label network and then extracting the network characteristics and the attribute characteristics of the labels within 30 days after the labels are firstly proposed, the prediction model of the future development trend of the labels based on the SVM is constructed, so that the reasonable prediction is provided for the newly appeared labels in the websites, and the future label recommendation and knowledge information propagation have important significance.

Claims

1. A construction method of a label development trend prediction model based on an SVM is characterized by comprising the following steps:

and step 3: constructing a label network, regarding labels appearing in the same information content, namely considering that the labels have a relationship, and forming a connecting edge between every two labels; traversing all the information to obtain a label network graph G which is entitled to have undirected network_TagWherein, the node is a new label, the connecting edge is the relation between labels, and the weight of the network is the frequency of the common appearance of the node and the label;

and 4, step 4: extracting characteristic data, and setting the sample label set U as { U ═ U_pop,U_unpopExtracting network characteristics and attribute characteristics of M days after the first creation of the inner label, and establishing a sample training data set;

in the step 4, extracting the network characteristics of the sample label, wherein M is 30, and the network characteristics include the following modes:

1) after new label is proposedCentrality of relativity over 30 days: label t_iValue D of_iThe calculation adopts a mode of removing isolated nodes, and the calculation formula is as follows:

wherein N represents the total number of tags in the network; a is_ijElements representing the network adjacency matrix, i.e. if the label t_iAnd t_jWith connecting edges, then a_ij1, otherwise a_ij＝0；

wherein D is_iIndicates the label t_iA value of (d);

indicates the label t_iThe sum of the neighbor node values;

indicates the label t_iAverage geodesic distance to neighbor tag nodes;

wherein η is a proportionality constant, and A ═ a_ijw_ij) Is a weighted network adjacency matrix, where w_ijIndicates the label t_iAnd t_jAnd has a weight of w_ij＝w_jiLet x be ═ x₁x₂…x_N]^TThen equation (5) can be written in the form of a matrix as follows:

x＝ηAx， (6)

x is the matrix A is the eigenvalue η^-1The feature vector under the correspondence is also called as feature vector centrality;

wherein E is_iIndicates the label t_iK of (a)_iNumber of edges, k, actually existing between the neighboring label nodes_i(k_i-1)/2 represents a label t_iK of (a)_iThe maximum number of edges possibly existing between the neighbor nodes;

and 5: and (3) adopting a machine learning classifier model to support a vector machine (SVM), selecting a kernel function, training to generate a label prevalence trend prediction model based on the SVM, and performing cross validation by ten folds to obtain a model result.

2. The construction method of the SVM-based label development trend prediction model as claimed in claim 1, wherein: in the step 1, data after N days is selected as preprocessed data, wherein the selection of N follows the following rule: it is ensured that the first 10% of the tag data in the web site has been generated within N days, i.e. the tag network in the web site has been preliminarily formed.

3. The construction method of the SVM-based label development trend prediction model according to claim 1 or 2, characterized in that: in the step 2, the sample label data is selected, the labels are arranged in a descending order of frequency, and the set is recorded as

Selecting

The labels with the middle proportion of top α% serve as popular labels, and the set of the popular labels is marked as U_popTaking all the labels with the label proportion of rear β% as a non-popular label set, and recording the set as Q_unpopFor each popular label t_pop∈U_popSearch and tag t_popThe tag with the latest creation time of (1) is marked as t_unpopWhile satisfying t_unpop∈Q_unpopAs a non-popular label, to be a comparison of popular labels, the set thereof is denoted as U_unpop。

4. The method for constructing the SVM-based label development trend prediction model according to claim 1 or 2, wherein in the step 4, the attribute features of the sample labels are extracted, and the extracting of the attribute features comprises the following steps:

4.1) all answers to the question containing the new label within 30 days after the new label is presented;

4.2) average number of answers and average number of questions and average time lapse before all the contributors and respondents participating in the tag for 30 days;

4.3) flatness of the label within 30 daysAnswer response time of uniform question

The calculation method is as follows:

let 30 days contain the label t_iThe number of problems of

Tag t within 30 days_iThe number of answers to the s-th question of

Label t_iS question creation time

Counting the creation time of the v-th answer

The calculation formula is as follows:

4.4) the number of all participating users of the tag within 30 days, i.e. the sum of the questioners and the respondents of the question;

4.5) average word length of all the problems containing the tag within 30 days;

4.6) the number of praise containing all the problems of the tag within 30 days.

5. The construction method of the SVM-based label development trend prediction model according to claim 1 or 2, characterized in that: in the step 5, the support vector machine SVM binary model is constructed by the following process:

first, the selection of the kernel function is determined using a Gaussian kernel RBF, i.e., sample t_iAnd t_jInner products between feature spaces using them in the original sample space through a function k (t)_i,t_j) Calculated, the expression is as follows:

where δ represents the bandwidth of the gaussian kernel;