CN108549907A

CN108549907A - A kind of data verification method based on multi-source transfer learning

Info

Publication number: CN108549907A
Application number: CN201810320808.6A
Authority: CN
Inventors: 李石君; 刘洋; 杨济海; 邓永康; 余伟; 余放; 李宇轩
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2018-04-11
Filing date: 2018-04-11
Publication date: 2018-09-18
Anticipated expiration: 2038-04-11
Also published as: CN108549907B

Abstract

The present invention proposes a kind of data verification method based on multi-source transfer learning.The method of the present invention is extraction set of source data and the corresponding site traffic quantity of target training set and is normalized；The SVR models based on weight are built by transfer learning model SVR models and radial basis function；Initialization source data and the website weight in target province are simultaneously normalized, and obtain merging training set by merging normalization set of source data, normalization target training dataset, normalization business datum amount training set and normalization number of services respectively；Training set will be merged and normalized vector establishes prediction model and computation model error parameter；Successive ignition simultaneously calculates final prediction model；The prediction site traffic quantity in target province is obtained with final prediction model, and renormalization is carried out to prediction site traffic quantity；Compared with prior art, the present invention improves the quality of data, has saved data resource.

Description

A kind of data verification method based on multi-source transfer learning

Technical field

The invention belongs to the scope of transfer learning more particularly to a kind of data verification methods based on multi-source transfer learning.

Background technology

Second physical network of the national grid communications management system (TMS) as Utilities Electric Co., carries operation of power networks It is power grid security, stabilization, the important leverage of economical operation with the core business of management.As Utilities Electric Co.'s communication speciality Hard core control system, TMS systems have played great function in terms of resource management, real time monitoring, operational management, while also accumulating Mass data is tired out.TMS systems are preserved in the form of database, and constituent parts independently dispose database server.Mainly Including the business datum that TMS resource datas, alarm data, work order data, internal each module generate；Network Communication company of state and each point Portion, provincial company, municipality directly under the Central Government, resource possession between the superior and the subordinate's system, trouble ticket dispatch, statistical data report, mission dispatching, alarm Data such as report at the data；There are the stream compressions such as account data, workflow for external system at the same level.But TMS system data matter Amount problem drastically influences data analysis and decision in actual production, be mainly manifested in static resource data do not conform to the actual conditions, Dynamic resource data correlation mistake, basic data are fresh-keeping to implement three aspects not in place, and it is logical for electric power mainly to affect TMS systems Letter lean management provides the realistic meaning of strong support.At the same time, different provinces size of data differences are very in TMS systems Greatly.The smaller provincial company data scale of network size is in 1G~2G, as the big unit of the network sizes such as state's Network Communication, data scale Reach 30G~40G, particularly for some particular services data relatively from far-off regions even only hundreds of kb, these data It is not enough to one good conventional machines learning model of training at all.

The data quality problems such as the loss of data, mistake, expired are always an important topic of big data analysis, every year Should be data quality problem can all bring huge loss to society.It is shown according to the investigation of German data analysis mechanism：" the U.S. It is annual because bad data and caused by up to 600,000,000,000 dollars of loss ", in U.S.'s malpractice caused by error in data 98000 patients are made to die every year.For TMS systems, power business management frequency is low, and service management data are mostly with the statement form moon Based on degrees of data, it is not carried out the daily management of (or higher frequency) to business progress and state.Secondly, business procedure data record Enter and safeguards that data generated time lags behind business procedure, does not meet actual data largely to produce, this phenomenon not in time On company on the business in actual production carry out judge and decision bring serious influence, so we carry out data analysis it Preceding necessary focused data itself.The present invention judges that number of services is in station system by the prediction to site traffic quantity No missing, to find abnormal website.This respect data are widely different according to the difference in regional province.It is enough for data volume Province, traditional machine learning method can obtain good effect, such as support vector regression, neural network scheduling algorithm, But traditional machine learning requires the distribution of training data and test data to be consistent, the data in each province can not be put It trains together, so will go wrong for the training in the small area of data volume, if by force with a regional data Carry out analysis can because data not enough and obtain bad model, or the data in each province put together training can because Each data set distribution inconsistent and cause modelling effect to be deteriorated.The data using other provinces are proposed the present invention is based on this By transfer learning method training objective data, achieve the purpose that abnormal website detection.

Transfer learning is a new field of machine learning, its purpose is using already present knowledge to different but phase Pass field is trained study.Transfer learning relaxes two primary conditions of conventional machines study：Training data and test number According to meeting independent same distribution, and possesses enough data and train a good model.Studies have shown that two different fields Similarity is higher, and transfer learning is easier, and effect is better, and otherwise effect is often bad, or even " negative transfer " occurs and obtain result. Domain adaptation are the hotter research directions in transfer learning field, and Pan et al. proposes the TCA of domain adaptation (Transfer Component Analysis) algorithm, TCA belong to the transfer learning method of feature based, its thought is to work as When source domain and aiming field are in different data distribution, the data in two fields are mapped to the reproducing kernel Xi Er of a higher-dimension together Bert space.In this space, the data distance of source and target is minimized, while farthest retaining their own inside Attribute.TCA algorithms only consider target domain and source domain data in another correlation spatially, excessively single, while TCA The time complexity of algorithm is relatively high.Dai et al. proposes the TrAdaBoost (Transfer of Case-based Reasoning correlation AdaBoost) algorithm, the thought of the algorithm be it is maximum utilize source data, find in source data with the relevant data of target data, Then learn with target data together training.But TrAdaBoost algorithms are only utilized a source data, the result of algorithm according to Rely the correlation with target data in source data, the correctness property associated with the data of algorithm is directly proportional, if related very weak, is easy Generate negative transfer.The correlation that Yao et al. passes through consideration multiple sources and target, it is proposed that two kinds of multi-source transfer learning algorithms, point It is not MTrA (MultiSource-TrAdaBoost) and TTrA (Task-TrAdaBoost), the thought of MTrA algorithms is source number According to being there are multiple data sources, trained using current iteration and the strongest data source of target data correlation during each iteration Then Weak Classifier obtains strong classifier；The thought of TTrA algorithms, which is each iteration, to be respectively trained one with all source datas A Weak Classifier, then the grader of selection and target data error minimum, one strong by these combining classifiers after iteration is complete Grader.Both each iteration of multi-source transfer learning algorithm can all select with the strongest data source of target data correlation, Although can guarantee that the source data of migration is most related to target in this way, they are without the information using other data sources, in reality The cost of each data source is very high in the production of border, and this operation wastes a large amount of resource of company.Data in TMS systems The problem of quality, has seriously affected judgement and operation of the company to practical business, the distributional differences of each department data, data volume Difference also gives the challenge of the discovery band of data quality problem.

Invention content

To solve the above-mentioned problems, the present invention proposes a kind of data verification method based on multi-source transfer learning, this hair Technical solution is used by bright：

Step 1：Type of site, website voltage class, website scheduling grade, website, which are obtained, by system data table builds up year Optical transmission device quantity, website said system and the website centrad being calculated by pagerank algorithms in limit, website Structure the website attribute further builds set of source data by the website attribute of each website in each province and is normalized, passes through prediction The website attribute in province further builds target training set and is normalized, and extracts set of source data and target training set corresponds to Site traffic quantity and be normalized；

Step 2：The SVR models based on weight are built by transfer learning model SVR models and radial basis function；

Step 3：Initialize source data and the weight of each website in target province, normalization initialization source data and target The weight of each website in province, and initialize the website in source data and target province in weighting multi-source TrAdaBoost algorithms and weigh Weight, by merging normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively And normalization number of services obtains merging training set；

Step 4：Training set and normalized vector will be merged, prediction model and computation model error ginseng are established by step 2 Number；

Step 5：Step 4 is repeated to maximum iteration and calculating final prediction model；

Step 6：The website attribute in target province is predicted to obtain the prediction in target province with final prediction model Site traffic quantity, and renormalization is carried out to prediction site traffic quantity.

Preferably, the attribute of website described in step 1, that is, feature vector is：

Wherein,For province S_kThe website attribute of website m,N is the quantity in province,For province S_kThe quantity of website,For province S_kThe type of site of website m,For province S_k The website voltage class of website m,For province S_kThe website of website m dispatches grade,For province S_kWebsite m Website build up the time limit,For province S_kOptical transmission device quantity in the website of website m,For province S_kWebsite m Website said system,For province S_kThe website centrad of website m；

Type of site, website voltage class, website scheduling grade, website can be obtained from the tables of data of system builds up year Optical transmission device quantity, website said system in limit, website, province S_kThe website centrad calculating process of website m is root first It is initialized according to the degree and website quantity of website：

Wherein,For province S_kThe centrad of website m,For province S_kThe quantity of website,For province S_kWebsite The degree of m further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:

Wherein, iter is the number of PageRank algorithm iterations, N_I=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iteration_kThe centrad of website m,For province S_kAll websites to province S_kWebsite m There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website The optical cable number with outer connection, α is damped coefficient；

Set of source data is built according to the website attribute in the larger each province of data volume：

Wherein, N is the quantity in the larger province of data volume,For S_kA source data, S_kA source data, that is, province S_k IncludingA sample isA website:

Wherein,For province S_kThe quantity of quantity, that is, sample of website；For province S_kThe website attribute of website m,S_NFor the quantity in province,For province S_kThe quantity of website,For province S_k The type of site of website m,For province S_kThe website voltage class of website m,For province S_kWebsite The website of m dispatches grade,For province S_kThe website of website m builds up the time limit,For province S_kThe website of website m Middle optical transmission device quantity,For province S_kThe website said system of website m,For province S_kIn the website of website m Heart degree；

By predicting province S_TWebsite attribute build target training set：

Wherein, n_TProvince S is predicted for the number of samples of target training set_TWebsite quantity,To predict province S_TIt stands Point i (i ∈ [1, n_T]) website attribute, that is, feature vector be：

Wherein,To predict province S_TThe type of site of website i,To predict province S_T The website voltage class of website i,To predict province S_TThe website of website i dispatches grade,It is saved for prediction Part S_TThe website of website i builds up the time limit,To predict province S_TOptical transmission device quantity in the website of website i,To predict province S_TThe website said system of website i,To predict province S_TThe website centrad of website i；

Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedAnd normalization target training set

Count province S in set of source data D_kCorresponding site traffic quantity obtain number of services data set and be：

Wherein, S_k∈ [1, S_N],For province S_kWebsite quantity meter；

Count target training setIn i.e. province S_TCorresponding site traffic quantity obtain target service quantity training collection For：

Wherein,For province S_kWebsite quantity meter；

By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max：

Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service Quantity training collectionIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing Normalization number of services data set is respectively obtained after the standardized normalization of min-maxWith normalization target service quantity training Collection

Preferably, the SVR models based on weight described in step 2 are by understanding normalization source number described in step 1 It is according to collection：

S_kA normalization source data, that is, province S_kIncludingA sample isA website:

It is according to normalization set of source data structure training dataset：

Wherein, S_NFor the quantity of the quantity, that is, sample in province,For province S_kQuantity, that is, training dataset of websiteSize,For training datasetMiddle province S_kThe normalization number of services of website i,For training number According to collectionMiddle province S_kNormalization website attribute, that is, normalization characteristic vector of website i is:

Wherein,For province S_kThe type of site of website m,For Province S_kThe website voltage class of website m,For province S_kThe website of website m dispatches grade,For province S_k The website of website m builds up the time limit,For province S_kOptical transmission device quantity in the website of website m,For province S_kThe website said system of website m,For province S_kThe website centrad of website m；

To S_kA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are：

Wherein, q is the weight parameter of model, and b is the straggling parameter of model；

The parametric solution process of w-SVR models based on weight is：

Defining linear ε insensitive loss function is：

Wherein, ε is insensitive loss value, as province S_kThe normalization number of services of website iWith regression estimates function Predicted valueBetween difference be less than ε, loss be equal to 0；

The present invention selects Radial basis kernel functionBy training dataset In nonlinear transformation to another feature space, and regression estimates letter is constructed in feature space after Radial basis kernel function transformation Number, and initialize S_kA normalization source dataIn weightRadial basis kernel function is public Formula：

Wherein, σ²For training datasetVariance；

Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is：

Wherein, ξ_iFor the first slack variable parameter, ξ '_iFor the second slack variable parameter, ε is insensitive loss value, and C is mould Shape parameter, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, will be excellent Change problem is converted to：

Wherein, α_iFor the first Lagrangian, α '_iFor the second Lagrangian, α is solved_i,α'_iValue, simultaneously KKT conditions should be met, therefore had：

Find out Model Weight parameter q, straggling parameter b:

Wherein,Finally obtain regressive prediction model：

Preferably, each province S in source data D described in step 3_kWebsite weights initialisation be：

Wherein,For target province website quantity；

Province S_kWebsite weight by normalization obtain source data province S_kNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms：

Target province S_TWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms：

Merge training dataset：

Wherein,To normalize set of source data described in step 1Middle province S_kNormalization source data, N is province Quantity,For：

Wherein,Each element be website normalization attribute,For province S_kWebsite quantity, N be province number Amount；

Wherein,To normalize number of services data set described in step 1Middle province S_kNormalization number of services, N For the quantity in province,For：

Wherein,Each element be website normalization number of services,For province S_kWebsite quantity, N is province Quantity；

Wherein,To normalize target training set described in step 1：

Wherein,In each element be target province S_TNormalization attribute,For target province S_TWebsite quantity

Wherein,For the normalization target service quantity training collection of target province described in step 1：

Wherein,Each element be target province website normalization number of services,For the website in target province Quantity；

Preferably, training dataset D will be merged described in step 4_k,Y_k, weighting multi-source TrAdaBoost algorithms in source number According to website weightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight：

Wherein,For province S in the t times iteration_kSVR model of the k-th based on weight, N is the number of source data Amount is the quantity in province,For province S in the t times iteration_kThe first Lagrangian of website i,For in the t times iteration Province S_kThe second Lagrangian of website i,For province S in the t times iteration_kThe straggling parameter of website i, For province S_kThe Radial basis kernel function of website i；

Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection The error in the t times iteration：

Wherein,For target province S in the t times iteration_TThe normalized weight of website i,Obtained target province S_TThe number of services predicted value of website i,It is target province S_TNumber of services, that is, actual value of website i, according to errorUpdate Prediction modelWeight：

Finally, the candidate prediction model h of the t times iteration is obtained_t:

Meanwhile calculating candidate prediction model h_tIn target detection data D_T,Y_TOn error, w_t,iFor target province data station The weight of point：

Parameter phi for updating sample weights is set_t：

Wherein, ε_tError for the model obtained when the t times iteration updates the weight of target data sample：

Wherein,For target province S in the t times iteration_TThe weight of website i,Obtained target province S_TWebsite i Number of services predicted value,It is target province S_TNumber of services, that is, actual value of website i, ε are insensitive loss value,For The website quantity in target province；

Update the weight of each regional source data sample：

Wherein,For the t times iteration source data province S_kThe weight of website i,The website industry that the t times iteration obtains Business quantitative forecast value,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameterFor：

Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number；

Preferably, repeating step 4 described in step 5 to maximum iteration and calculating final prediction model and be

Final prediction model f (x) is calculated if t=M：

Wherein, φ_tFor the parameter value generated in each iterative process, h_t(x) it is the model generated in each iterative process；

Preferably, for target province S described in step 6_TWebsite i website attribute, that is, feature vector：

Model predication value isPredicted value executes renormalization operation：

Wherein, min is to take set minimum value, and max is to take set maximum value,

Compared with prior art, present invention saves data resources, improve the quality of data.

Description of the drawings

Fig. 1：For flow chart of the method for the present invention.

Specific implementation mode

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

The specific steps that the embodiment of the present invention is introduced with reference to Fig. 1, the present invention provides one kind to be learned based on multi-source migration The data verification method of habit, the specific steps are：

The attribute of website described in step 1, that is, feature vector is：

Wherein,For province S_kThe website attribute of website m,N=10 is the number in province Amount,For province S_kThe quantity of website,For province S_kThe type of site of website m,To save Part S_kThe website voltage class of website m,For province S_kThe website of website m dispatches grade,For province S_kIt stands The website of point m builds up the time limit,For province S_kOptical transmission device quantity in the website of website m,For province S_kIt stands The website said system of point m,For province S_kThe website centrad of website m；

Wherein, iter is the number of PageRank algorithm iterations, N_I=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iteration_kThe centrad of website m,For province S_kAll websites to province S_kWebsite m There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website The optical cable number with outer connection, α=0.85 be damped coefficient；

Wherein, N=10 is the quantity in the larger province of data volume,For S_kA source data, S_kA source data saves Part S_kIncludingA sample isA website:

By predicting province S_TWebsite attribute build target training set：

Wherein, S_k∈ [1, S_N],For province S_kWebsite quantity meter；

Wherein,For province S_kWebsite quantity meter；

SVR models based on weight described in step 2 are by understanding that normalizing set of source data is described in step 1：

The parametric solution process of w-SVR models based on weight is：

Defining linear ε insensitive loss function is：

Wherein, ε=1/e is insensitive loss value, as province S_kThe normalization number of services of website iAnd regression estimates The predicted value of functionBetween difference be less than ε, loss be equal to 0；

Wherein, σ²For training datasetVariance；

Wherein, ξ_iFor the first slack variable parameter, ξ '_iFor the second slack variable parameter, ε=1/e is insensitive loss value, C It is model parameter, q is the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, Optimization problem is converted to：

Find out Model Weight parameter q, straggling parameter b:

Wherein,Finally obtain regressive prediction model：

Each province S in source data D described in step 3_kWebsite weights initialisation be：

Wherein,For target province website quantity；

Merge training dataset：

Wherein,Each element be website normalization attribute,For province S_kWebsite quantity, N is province Quantity；

Wherein,To normalize number of services data set described in step 1Middle province S_kNormalization number of services, N =10 be the quantity in province,For：

Wherein,Each element be website normalization number of services,For province S_kWebsite quantity, N=10 is The quantity in province；

Wherein,To normalize target training set described in step 1：

Training dataset D will be merged described in step 4_k,Y_k, weighting multi-source TrAdaBoost algorithms in source data website WeightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight：

Parameter phi for updating sample weights is set_t：

Wherein,For target province S in the t times iteration_TThe weight of website i,Obtained target province S_TWebsite i Number of services predicted value,It is target province S_TNumber of services, that is, actual value of website i, ε=1/e are insensitive loss value,For the website quantity in target province；

Update the weight of each regional source data sample：

Wherein,For the t times iteration source data province S_kThe weight of website i,The website industry that the t times iteration obtains Business quantitative forecast value,It is site traffic quantity actual value, ε=1/e is insensitive loss value,For the station number in province, ParameterFor：

Wherein, M=200 is maximum iteration, and t is current iteration number t ∈ [1, M], according to source number in step 1 According toIn,For the summation of each province website number；

Step 4 is repeated described in step 5 to maximum iteration and calculating final prediction model and be

If t=M, M=200 then calculate final prediction model f (x)：

For target province S described in step 6_TWebsite i website attribute, that is, feature vector：

Model predication value isPredicted value executes renormalization operation：

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. a kind of data verification method based on multi-source transfer learning, which is characterized in that include the following steps：

Step 1：By system data table obtain type of site, website voltage class, website dispatch grade, website build up the time limit, Optical transmission device quantity, website said system and the website centrad structure being calculated by pagerank algorithms in website Website attribute further builds set of source data by the website attribute of each website in each province and is normalized, by predicting province Website attribute further build target training set and be normalized, extract set of source data and the corresponding station of target training set Point number of services is simultaneously normalized；

Step 3：Initialize source data and the weight of each website in target province, normalization initialization source data and target province The weight of each website, and source data and the website weight in target province in weighting multi-source TrAdaBoost algorithms are initialized, lead to It crosses and merges normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively and return One change number of services obtains merging training set；

Step 4：Training set and normalized vector will be merged, prediction model and computation model error parameter are established by step 2；

Step 6：The website attribute in target province is predicted with final prediction model to obtain the prediction website in target province Number of services, and renormalization is carried out to prediction site traffic quantity.

2. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：Institute in step 1 Stating website attribute i.e. feature vector is：

Wherein,For province S_kThe website attribute of website m, S_k∈[1,S_N],N is the quantity in province,To save Part S_kThe quantity of website,For province S_kThe type of site of website m,For province S_kWebsite m's Website voltage class,For province S_kThe website of website m dispatches grade,For province S_kThe website of website m is built At the time limit,For province S_kOptical transmission device quantity in the website of website m,For province S_kThe website of website m Said system,For province S_kThe website centrad of website m；

Type of site can be obtained from the tables of data of system, website voltage class, website scheduling grade, website build up the time limit, Optical transmission device quantity, website said system, province S in website_kThe website centrad calculating process of website m is according to station first The degree and website quantity of point are initialized：

Wherein,For province S_kThe centrad of website m,For province S_kThe quantity of website,For province S_kWebsite m's Degree further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:

Wherein, iter is the number of PageRank algorithm iterations, N_I=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iteration_kThe centrad of website m,For province S_kAll websites to province S_kWebsite M has the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For websiteThe optical cable number with outer connection, α is damped coefficient；

Wherein, N is the quantity in the larger province of data volume,For S_kA source data, S_kA source data, that is, province S_kIncludingA sample isA website:

Wherein,For province S_kThe quantity of quantity, that is, sample of website；For province S_kThe website attribute of website m, S_k∈[1, S_N],S_NFor the quantity in province,For province S_kThe quantity of website,For province S_kWebsite m Type of site,For province S_kThe website voltage class of website m,For province S_kThe station of website m Point scheduling grade,For province S_kThe website of website m builds up the time limit,For province S_kLight in the website of website m Transmission device quantity,For province S_kThe website said system of website m,For province S_kThe website center of website m Degree；

By predicting province S_TWebsite attribute build target training set：

Wherein, n_TProvince S is predicted for the number of samples of target training set_TWebsite quantity,To predict province S_TWebsite i (i ∈[1,n_T]) website attribute, that is, feature vector be：

Wherein,To predict province S_TThe type of site of website i,To predict province S_TWebsite i Website voltage class,To predict province S_TThe website of website i dispatches grade,To predict province S_TIt stands The website of point i builds up the time limit,To predict province S_TOptical transmission device quantity in the website of website i,It is pre- Survey province S_TThe website said system of website i,To predict province S_TThe website centrad of website i；

Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedWith And normalization target training set

Wherein, S_k∈[1,S_N],For province S_kWebsite quantity meter；

Count target training setIn i.e. province S_TCorresponding site traffic quantity obtain target service quantity training collection and be：

Wherein,For province S_kWebsite quantity meter；

Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service quantity Training setIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing min- Normalization number of services data set is respectively obtained after the standardized normalization of maxWith normalization target service quantity training collection

3. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：Institute in step 2 It is by understanding that normalizing set of source data is described in step 1 to state the SVR models based on weight：

Wherein, S_NFor the quantity of the quantity, that is, sample in province,For province S_kQuantity, that is, training dataset of website's Size,For training datasetMiddle province S_kThe normalization number of services of website i,For training datasetMiddle province S_kNormalization website attribute, that is, normalization characteristic vector of website i is:

Wherein, For province S_kThe type of site of website m,For province S_k The website voltage class of website m,For province S_kThe website of website m dispatches grade,For province S_kWebsite m Website build up the time limit,For province S_kOptical transmission device quantity in the website of website m,For province S_kWebsite The website said system of m,For province S_kThe website centrad of website m；

The parametric solution process of w-SVR models based on weight is：

Defining linear ε insensitive loss function is：

Wherein, ε is insensitive loss value, as province S_kThe normalization number of services of website iWith the prediction of regression estimates function ValueBetween difference be less than ε, loss be equal to 0；

The present invention selects Radial basis kernel functionBy training datasetIt is non-linear It transforms in another feature space, and constructs regression estimates function in feature space after Radial basis kernel function transformation, and just Beginningization S_kA normalization source dataIn weightRadial basis kernel function formula：

Wherein, σ²For training datasetVariance；

Wherein, ξ_iFor the first slack variable parameter, ξ_i' it is the second slack variable parameter, ε is insensitive loss value, and C is model ginseng Number, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, optimization is asked Topic is converted to：

Wherein, α_iFor the first Lagrangian, α '_iFor the second Lagrangian, α is solved_i,α’_iValue, while should expire Sufficient KKT conditions, therefore have：

Find out Model Weight parameter q, straggling parameter b:

Wherein, 0 ＜ α_i,Finally obtain regressive prediction model：

4. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：

Wherein, it is province S_kSample number, that is, website quantity, target province S_TWebsite weights initialisation be:

Wherein,For target province website quantity；

Merge training dataset：

Wherein,To normalize set of source data described in step 1Middle province S_kNormalization source data, N be province quantity,For：

Wherein,Each element be website normalization attribute,For province S_kWebsite quantity, N be province quantity；

Wherein,To normalize number of services data set described in step 1Middle province S_kNormalization number of services, N is province Quantity,For：

Wherein,Each element be website normalization number of services,For province S_kWebsite quantity, N be province number Amount；

Wherein,To normalize target training set described in step 1：

Wherein,Each element be target province website normalization number of services,For the website quantity in target province.

5. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：Institute in step 4 Training dataset D will be merged by stating_k,Y_k, weighting multi-source TrAdaBoost algorithms in source data website weightAnd target saves PartWebsite weight pass through step 2 build the SVR model sets based on weight：

Wherein,For province S in the t times iteration_kSVR model of the k-th based on weight, N is that the quantity of source data saves The quantity of part,For province S in the t times iteration_kThe first Lagrangian of website i,For province S in the t times iteration_k The second Lagrangian of website i,For province S in the t times iteration_kThe straggling parameter of website i,For province S_kThe Radial basis kernel function of website i；

Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection Error in t iteration：

Wherein,For target province S in the t times iteration_TThe normalized weight of website i,Obtained target province S_TIt stands The number of services predicted value of point i,It is target province S_TNumber of services, that is, actual value of website i, according to errorUpdate prediction ModelWeight：

Meanwhile calculating candidate prediction model h_tIn target detection data D_T,Y_TOn error, w_t,iFor target province data station Weight：

Parameter phi for updating sample weights is set_t：

Wherein,For target province S in the t times iteration_TThe weight of website i,Obtained target province S_TThe industry of website i Business quantitative forecast value,It is target province S_TNumber of services, that is, actual value of website i, ε are insensitive loss value,For target The website quantity in province；

Update the weight of each regional source data sample：

Wherein,For the t times iteration source data province S_kThe weight of website i,The site traffic number that the t times iteration obtains Predicted value is measured,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameter For：

Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number.

6. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：Institute in step 5 It states and repeats step 4 to maximum iteration and calculating final prediction model and be：

Final prediction model f (x) is calculated if t=M：

Wherein, φ_tFor the parameter value generated in each iterative process, h_t(x) it is the model generated in each iterative process.

7. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that：Institute in step 6 It states for target province S_TWebsite i website attribute, that is, feature vector：

Model predication value isPredicted value executes renormalization operation：