CN107292519B

CN107292519B - Browsing service perception index prediction method based on multi-label learning

Info

Publication number: CN107292519B
Application number: CN201710493097.8A
Authority: CN
Inventors: 李克; 徐小龙; 王海
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2020-11-03
Anticipated expiration: 2037-06-26
Also published as: CN107292519A

Abstract

The invention discloses a browsing-type service perception index prediction method based on multi-label learning, which aims to solve the problem of how to timely and accurately predict a KQI index of a webpage browsing-type service of a user according to the scene of the user; according to the massive user service perception historical data, namely the quality of service perception indexes under different scenes, the service experience quality of a user under a specific scene is predicted and early warned, the service experience problem can be found as soon as possible, relevant measures can be taken in time to improve the service experience problem, and the complaint rate and the off-network rate are effectively reduced.

Description

Browsing service perception index prediction method based on multi-label learning

Technical Field

The invention belongs to the technical field of network services, and particularly relates to a browsing-type service perception index prediction method based on multi-label learning.

Background

When a mobile network user uses an OTT service (e.g., web browsing, video playing, etc.), the quality of the service experience of the mobile network user can be generally evaluated by using a set of KQI (key quality indicator) indexes, such as web page opening delay, download rate, etc. The quality of the experience is affected by various factors, including the quality of the terminal, the quality of the mobile network where the service is used, the quality of the APP, the bandwidth and load of the SP website server cluster, and the like.

As a transmission channel provider for various services and a key link for service experience guarantee, a telecom operator needs to guarantee service experience of a user as much as possible, otherwise, the telecom operator may cause complaints of the user or even off-network.

At present, network operation and optimization departments, which are generally telecommunication operators, guarantee network quality through daily network optimization operations, but a great difference still exists between the network quality and service experience of users, and good network quality cannot necessarily guarantee good service experience (since the service experience is a comprehensive effect of the above factors). The customer service department finds the problem of service experience only when receiving the complaint of the user, and then coordinates the network operation and optimization department to troubleshoot and solve the problem, which is often passive.

If the service experience of the user can be continuously monitored in daily network operation, and the service experience of the user in a specific scene can be predicted and early warned according to massive user service perception historical data (the quality of service perception indexes in different scenes), the service experience problem can be found as soon as possible, relevant measures can be taken in time to improve the service experience problem, and the complaint rate and the off-network rate can be effectively reduced.

Disclosure of Invention

The invention provides a browsing service perception index prediction method based on multi-label learning, and aims to solve the problem of timely and accurately predicting a KQI index of a webpage browsing service of a user according to a scene where the user is located.

In order to achieve the purpose, the invention adopts the following technical scheme:

a browsing class service perception index prediction method based on multi-label learning comprises the following steps:

s1, constructing a training sample set for the browsing service perception sample data set;

step S2, constructing a k nearest neighbor sample set of the training samples;

step S3: calculating prior probability and normalization frequency matrix

For each mark item y_j，j1-q, the prior probability is calculated according to the following formula (1)

And

wherein H_jAnd

respectively representing the newly acquired unlabeled sample x with and without the label item y_jTo do so

And

then respectively represent H_jAnd

the prior probability of being established, s being a control parameter,

calculating a normalized frequency matrix [ f ] according to the following formulas (2) and (3)_j[r]_k×qAnd

wherein the content of the first and second substances,_j(x_i) Represented training sample x_iHas a label y in a close-neighbor sample of_jNumber of samples [. C]Denotes rounding, f_j[r]Indicating the presence of a marker y in the training sample set_jAnd at the same time has a duty ratio of

Also has a label y_jThe number of training samples of (a) is,

it means that there is no label y in the training sample set_jAnd at the same time has a duty ratio of

Has a label y_jThe number of training samples;

step S4: constructing k neighbor sample set of unknown sample x

For unknown sample x, a k-neighbor sample set of the sample is constructed in the training sample set according to the method of step S2

The actual number of nearest neighbor samples is k_xWherein k is_x≤k；

Step S5: computing homolabeled statistics for unknown samples x

For each tag item y_jJ is 1 to q, and is counted according to the formula (4)

Number of samples with the marker entry { C }_jIs called the unknown sample x at its k_xHomologous statistics in the nearest neighbor sample set:

step S6: calculating likelihood probability of unknown sample x

Calculating likelihood probabilities according to equations (5) and (6)

And

wherein the content of the first and second substances,

indicating when the unknown sample x has a label y_jWhen there is a ratio in its nearest neighbor sample

Also having a label y_jThe likelihood of (d);

step S7: estimating the label value of an unknown sample x

The estimated value { Y) of the marker set Y of the unknown sample x is calculated from the following equations (7) and (8)₁,y₂I.e. that

Considering the strong correlation of two indexes of the first packet delay and the page opening delay, especially the influence of the first packet delay on the page opening delay, y is estimated₂That is, if the flag item of the page opening delay is established (that is, the flag value is 1), the following method is adopted to calculate:

preferably, step S1 includes the steps of:

step S1a, selecting attribute items of training sample set

Selecting a subset of the samples from all fields of the samples, namely { date, time, longitude, latitude, large area number, cell number, field strength, signal quality, website name, website IP, DNS IP, user identification and terminal model }, and using the selected subset as an attribute set x ═ x { x ═ of the training samples }₁，x₂，...，x_dD is the dimension of the attribute set; the system comprises attribute fields, a server and a server, wherein the attribute fields { date, time, longitude, latitude, field intensity and signal quality } are numerical data, and the attribute fields { major district number, cell number, website name, website IP, DNS IP, user identification and terminal model } are name data;

step S1b, selecting label items of training sample set

Selecting a subset of all fields of the sample, namely { initial packet delay, page open delay }, as a mark set Y of the training sample { Y ═ Y }₁，y₂，...，y_qQ is the dimension of a mark set, wherein a mark field { first packet time delay, page opening time delay } is Boolean data;

step S1c, selection of training sample

According to the attribute set and the tag set selected in step S1a and step S1bRandomly selecting m samples from the browsing traffic aware sample set as a training sample set D, i.e. D { (x)_i，Y_i)|1≤i≤m}；

Step S1d, conversion of training sample attribute values and label values

If the original values of date and time in the training sample are not numeric data, a conversion is made: the date value is defined as 0 by taking a certain date as a reference, the number of days from the reference date is taken as the representation of the date value in the training sample, the time is taken as a reference point when zero is taken, the minute is taken as granularity,

all numerical data in the training samples were normalized using the following formula:

wherein

The true value of the attribute i is represented,

and

the minimum and maximum values of the attribute in the training sample set are identified.

For each mark field in training sample { first packet delay y₁Page open delay y₂All numerical data (respectively recorded as numerical data) in the original' browsing service perception sample set

) According to the preset perception quality difference judgment threshold { T }₁,T₂The data are converted into Boolean type data according to the formula (9), namely:

wherein the function [ c ] indicates that 1 is returned when the condition c is satisfied, and 0 is returned otherwise.

Preferably, in step S2, each sample vector x in the training sample set is_iI is 1 to m, at most k nearest neighbor samples of the sample vector are searched in the training sample set, and k nearest neighbor sample set of the sample vector is formed

The number of the actual nearest neighbor samples of the sample set is k_iAnd k is_iK is not more than k; the specific method comprises the following steps:

step S2a, sample vector x_i＝{x_ilAnd l is 1-d, all the training sample sets are searched for x according to the date attribute_i1Samples whose distance is less than a set threshold Td (default value is 10) constitute an initial nearest neighbor sample set

Step S2b, in the initial nearest neighbor sample set

In (3), a sample is sought that satisfies at least one of the following conditions, namely, x_i5Same as, or calculated from longitude and latitude and x_iThe longitude and latitude Euclidean distance of the intermediate nearest neighbor sample set is smaller than a set threshold Tdis

Step S2c, calculating a middle nearest neighbor sample set

Medium sample vector and sample vector x_iThe weighted Euclidean distances are arranged according to ascending order of distance values, and the maximum first k samples are taken as a k nearest neighbor sample set

The invention has the following beneficial effects:

according to massive user service perception historical data (quality of service perception indexes under different scenes), service experience of a user under a specific scene is predicted and early warned, service experience problems can be found early, relevant measures can be taken timely to improve the service experience problems, and complaint rate and off-network rate are effectively reduced.

Drawings

FIG. 1 is a flow chart of a prediction method of the present invention;

FIG. 2 is a flow chart for constructing a training sample set.

Detailed Description

As shown in fig. 1 and 2, the invention provides a browsing-class service perception index prediction method based on multi-label learning, which comprises the following steps:

step S1: constructing a training sample set

Under a known local mobile network (such as an LTE network in beijing), when a user browses a certain webpage in a predefined target webpage set (such as a surf homepage, a search homepage and the like) on an intelligent terminal by using a webpage browsing service App (such as UCweb, QQ browser and the like), a 'webpage browsing service perception sample' at the moment is obtained in a manner of data acquisition App deployed on the user terminal and the like; all samples collected from a large number of user terminals in a certain time range form a browsing service perception sample set.

The information (i.e. sample field) contained in the web browsing service perception sample at least comprises: date, time, network type, cell identification, current longitude and latitude of the terminal, field intensity (different names in different types: Rxlevel of GSM network, RSRP of LTE network, etc.), signal quality (different names in different types: C/I, SINR, RSRQ, etc.), user identification (IMSI), terminal identification (IMEI or MEID), terminal model, browser App name, browsing website URL, browsing website IP, DNS IP, first packet delay, page opening delay, DNS analysis delay, TCP connection delay, GET request delay, and receiving response delay.

Wherein: the cell id is a combination of id parameters that uniquely identifies a cell, and generally consists of a macro cell number + a cell number. The names of the parameters used by different networks are different, for example, GSM, WCDMA and TD-SCDMA networks are LAC + CI, and LTE is TAC + ECI.

Wherein: the "top packet latency" is defined as the time elapsed from the initiation of the web browsing request by the user to the receipt of the first HTTP 200OK packet to the target server response. The first packet delay is DNS analysis delay + TCP connection response delay + GET request response delay.

Wherein: the "page opening delay" is defined as the time from the user initiating the browsing request to the completion of the entire HTTP page download (only the page text content, not including the secondary loading of resources). The page opening delay is the first packet delay plus the response receiving delay.

Wherein: "DNS resolution delay" refers to a delay from when a terminal initiates a DNS resolution request to when DNS resolution is completed; "TCP connection latency" refers to the latency from the end of DNS resolution to the completion of TCP connection (three-way handshake) establishment; "GET request latency" refers to the latency from the time a GET request is issued to the time a first TCP packet (containing an HTTP 200OK) is received; the "reception response delay" refers to a delay from the reception of the first response packet to the transmission of the FIN, ACK by the terminal (i.e., reception is completed).

Step S1 a: attribute item selection for training sample sets

Selecting a subset of the above samples, namely { date, time, longitude, latitude, large area number, cell number, field strength, signal quality, website name, website IP, DNS IP, user identification and terminal model } from all fields of the samples, and using the subset as an attribute set x ═ x { x, longitude, latitude, large area number, cell number, field strength, signal quality, website name, website IP, DNS IP, user identification and terminal model } of the training samples₁，x₂，...，x_dD is the dimension of the attribute set, where d is 13; the system comprises attribute fields, a server and a server, wherein the attribute fields { date, time, longitude, latitude, field intensity and signal quality } are numerical data, and the attribute fields { major district number, cell number, website name, website IP, DNS IP, user identification and terminal model } are name data;

step S1 b: labeled item selection for training sample set

Selecting a subset of the samples from all the fields of the samples, namely { initial packet delay, page open delay }, as trainingSample's marker set Y ═ Y₁，y₂，...，y_qQ is the dimension of the label set, where q is 2; wherein, the mark field { initial packet delay, page opening delay } is Boolean data;

step S1 c: selection of training samples

Randomly selecting m samples from the browsing traffic perception sample set as a training sample set D according to the attribute set and the tag set selected in steps S1a and S1b, namely D { (x)_i，Y_i)|1≤i≤m}；

Step S1 d: conversion of training sample attribute values and label values

If the original values of date and time in the training sample are not numeric data, a conversion is made: taking a certain date as a reference (such as 1 month and 1 day 2015), defining the date value as 0, and taking the number of days from the reference date as the representation of the date value in the training sample. Time is expressed in terms of zero as the reference point and minutes as the granularity.

All numerical data in the training samples are normalized by equation (1), i.e.:

wherein

The true value of the attribute i is represented,

and

) According to the preset perception quality difference judgment threshold { T }₁,T₂Converting the data into Boolean type data according to the formula (1), namely:

where the function [ c ] indicates that 1 is returned when the condition c holds, and 0 is returned otherwise.

Step S2: constructing k nearest neighbor sample set of training samples

For each sample vector x in the training sample set_iI is 1 to m, at most k nearest neighbor samples of the sample vector are searched in the training sample set, and k nearest neighbor sample set of the sample vector is formed

The number of the actual nearest neighbor samples of the sample set is k_i(ki. ltoreq.k); the specific method comprises the following steps:

step 2 a: for sample vector x_i＝{x_ilAnd l is 1-d, all and x (except the sample) are searched in the training sample set according to the date attribute_i1Samples whose distance is less than a set threshold Td (default value is 10) constitute an initial nearest neighbor sample set

And step 2 b: in the initial nearest neighbor sample set

In (3), a sample is sought that satisfies at least one of the following conditions, namely, x_i5(i.e. the major area number) is the same as or calculated from longitude and latitude and x_iThe longitude and latitude Euclidean distance of the intermediate nearest neighbor sample set is smaller than a set threshold Tdis (the default value is 2000 m), and the intermediate nearest neighbor sample set is formed

And step 2 c: computing a set of intermediate nearest neighbor samples

Step S3: calculating prior probability and normalization frequency matrix

For each mark item y_j，jThe prior probability is calculated by the following equation (2) when the values are 1 to q

And

wherein H_jAnd

respectively representing newly acquired unlabeled samples (called "unknown samples", i.e. only attribute information, no label information) x with and without label items y_j(i.e., the tag item y)_j1 and 0), and

and

then respectively represent H_jAnd

the prior probability of being true, s is the control parameter (typically taken to be 1).

Then, the normalization frequency matrix [ f ] is calculated according to the following equations (3) and (4)_j[r]]_k×qAnd

wherein the content of the first and second substances,_j(x_i) Represented training sample x_iHas a label y in a close-neighbor sample of_jNumber of samples [. C]Indicating rounding. F is then_j[r]Indicating the presence of a marker y in the training sample set_jAnd at the same time has a duty ratio of

Also has a label y_j(i.e., the label value is 1). While

It means that there is no label y in the training sample set_j(i.e., the flag is 0) and at the same time the duty ratio is

Has a label y_jThe number of training samples.

Step S4: constructing k neighbor sample set of unknown sample x

The actual number of nearest neighbor samples is k_x，(k_x≤k)；

Step S5: computing homolabeled statistics for unknown samples x

For each tag item y_jJ is 1 to q, and is counted according to the formula (5)

Number of samples having the flag term (i.e., value 1) in { C }_jIs called the unknown sample x at its k_xHomologous statistics in the nearest neighbor sample set:

step S6: calculating likelihood probability of unknown sample x

Calculating likelihood probabilities according to equations (6) and (7)

And

Also having a label y_jLikelihood (likelihood).

Step S7: estimating the label value of an unknown sample x

Based on the calculation results of the previous steps, the estimated value { Y ] of the marker set Y of the unknown sample x can be calculated by the following formulas (8) and (9)₁,y₂}. Wherein:

Claims

1. a browsing-class service perception index prediction method based on multi-label learning is characterized by comprising the following steps:

step s1, constructing a training sample set for the browsing service perception sample data set;

step s2, constructing a k nearest neighbor sample set of the training samples;

step s 3: calculating prior probability and normalization frequency matrix

For each mark item y_jJ is 1 to q, and the prior probability is calculated by the following formula (1)

And

wherein H_jAnd

And

then respectively represent H_jAnd

the established prior probability, s is the control parameter, q is the dimension of the mark set, m is the number of samples,

calculating a normalized frequency matrix [ f ] according to the following formulas (2) and (3)_j[r]]_k×qAnd [ f_j[r]]_k×q：

Also has a label y_jThe number of training samples of (a) is,

Has a label y_jThe number of training samples; k is a radical of_iThe number of actual nearest neighbor samples of a sample set is defined, and r is the sample set;

step S4: constructing k neighbor sample set of unknown sample x

The actual number of nearest neighbor samples is k_xWherein k is_x≤k；

Step S5: computing homolabeled statistics for unknown samples x

For each tag item y_jJ is 1 to q, and is counted according to the formula (4)

step S6: calculating likelihood probability of unknown sample x

Calculating likelihood probabilities according to equations (5) and (6)

And

wherein the content of the first and second substances,

Also having a label y_jS is a control parameter;

step S7: estimating the label value of an unknown sample x

The estimated value { Y) of the marker set Y of the unknown sample x is calculated from the following equations (7) and (8)₁，y₂I.e. that

wherein H₁、H₂Respectively, the unknown sample sets.

2. The browsing-class service awareness index prediction method based on multi-label learning according to claim 1, wherein the step 1 comprises the following steps:

step S1a, selecting attribute items of training sample set

step S1b, selecting label items of training sample set

From a sample stationThere are fields to choose their subset, i.e., { first packet delay, page open delay }, as the training sample's token set Y ═ Y₁，y₂，...，y_qQ is the dimension of a mark set, wherein a mark field { first packet time delay, page opening time delay } is Boolean data;

step S1c, selection of training sample

Randomly selecting m samples from the browsing traffic perception sample set as a training sample set D according to the attribute set and the mark set selected in the steps 1a and 1b, namely D { (x)_i，Y_i)|1≤i≤m}；

Step S1d, conversion of training sample attribute values and label values

wherein the content of the first and second substances,

the true value of the attribute i is represented,

and

the minimum and maximum values of the attribute in the training sample set are identified,

for each mark field in training sample { first packet delay y₁Page open delay y₂All numerical data in the original browsing service perception sample set are recorded as numerical data respectively

According to a preset perception quality difference judgment threshold { T }₁，T₂The data are converted into Boolean type data according to the formula (9), namely:

wherein the function

This indicates that 1 is returned when the condition c is satisfied, and 0 is returned otherwise.

3. The method as claimed in claim 1, wherein the step S2 is performed for each sample vector x in the training sample set_iI is 1 to m, at most k nearest neighbor samples of the sample vector are searched in the training sample set, and k nearest neighbor sample set of the sample vector is formed

step S2a, sample vector x_i＝{x_i1And l is 1-d), all and x are searched in the training sample set according to the date attribute_i1Samples whose distance is less than a set threshold Td (default value is 10) constitute an initial nearest neighbor sample set

Step S2b, in the initial nearest neighbor sample set

In (3), a sample is sought that satisfies at least one of the following conditions, namely, x_i5Is identical to, or is based onSum of latitude and longitude x_iThe longitude and latitude Euclidean distance of the intermediate nearest neighbor sample set is smaller than a set threshold Tdis

Step S2c, calculating a middle nearest neighbor sample set