CN108549907A - A kind of data verification method based on multi-source transfer learning - Google Patents

A kind of data verification method based on multi-source transfer learning Download PDF

Info

Publication number
CN108549907A
CN108549907A CN201810320808.6A CN201810320808A CN108549907A CN 108549907 A CN108549907 A CN 108549907A CN 201810320808 A CN201810320808 A CN 201810320808A CN 108549907 A CN108549907 A CN 108549907A
Authority
CN
China
Prior art keywords
website
province
target
normalization
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810320808.6A
Other languages
Chinese (zh)
Other versions
CN108549907B (en
Inventor
李石君
刘洋
杨济海
邓永康
余伟
余放
李宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201810320808.6A priority Critical patent/CN108549907B/en
Publication of CN108549907A publication Critical patent/CN108549907A/en
Application granted granted Critical
Publication of CN108549907B publication Critical patent/CN108549907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface

Abstract

The present invention proposes a kind of data verification method based on multi-source transfer learning.The method of the present invention is extraction set of source data and the corresponding site traffic quantity of target training set and is normalized;The SVR models based on weight are built by transfer learning model SVR models and radial basis function;Initialization source data and the website weight in target province are simultaneously normalized, and obtain merging training set by merging normalization set of source data, normalization target training dataset, normalization business datum amount training set and normalization number of services respectively;Training set will be merged and normalized vector establishes prediction model and computation model error parameter;Successive ignition simultaneously calculates final prediction model;The prediction site traffic quantity in target province is obtained with final prediction model, and renormalization is carried out to prediction site traffic quantity;Compared with prior art, the present invention improves the quality of data, has saved data resource.

Description

A kind of data verification method based on multi-source transfer learning
Technical field
The invention belongs to the scope of transfer learning more particularly to a kind of data verification methods based on multi-source transfer learning.
Background technology
Second physical network of the national grid communications management system (TMS) as Utilities Electric Co., carries operation of power networks It is power grid security, stabilization, the important leverage of economical operation with the core business of management.As Utilities Electric Co.'s communication speciality Hard core control system, TMS systems have played great function in terms of resource management, real time monitoring, operational management, while also accumulating Mass data is tired out.TMS systems are preserved in the form of database, and constituent parts independently dispose database server.Mainly Including the business datum that TMS resource datas, alarm data, work order data, internal each module generate;Network Communication company of state and each point Portion, provincial company, municipality directly under the Central Government, resource possession between the superior and the subordinate's system, trouble ticket dispatch, statistical data report, mission dispatching, alarm Data such as report at the data;There are the stream compressions such as account data, workflow for external system at the same level.But TMS system data matter Amount problem drastically influences data analysis and decision in actual production, be mainly manifested in static resource data do not conform to the actual conditions, Dynamic resource data correlation mistake, basic data are fresh-keeping to implement three aspects not in place, and it is logical for electric power mainly to affect TMS systems Letter lean management provides the realistic meaning of strong support.At the same time, different provinces size of data differences are very in TMS systems Greatly.The smaller provincial company data scale of network size is in 1G~2G, as the big unit of the network sizes such as state's Network Communication, data scale Reach 30G~40G, particularly for some particular services data relatively from far-off regions even only hundreds of kb, these data It is not enough to one good conventional machines learning model of training at all.
The data quality problems such as the loss of data, mistake, expired are always an important topic of big data analysis, every year Should be data quality problem can all bring huge loss to society.It is shown according to the investigation of German data analysis mechanism:" the U.S. It is annual because bad data and caused by up to 600,000,000,000 dollars of loss ", in U.S.'s malpractice caused by error in data 98000 patients are made to die every year.For TMS systems, power business management frequency is low, and service management data are mostly with the statement form moon Based on degrees of data, it is not carried out the daily management of (or higher frequency) to business progress and state.Secondly, business procedure data record Enter and safeguards that data generated time lags behind business procedure, does not meet actual data largely to produce, this phenomenon not in time On company on the business in actual production carry out judge and decision bring serious influence, so we carry out data analysis it Preceding necessary focused data itself.The present invention judges that number of services is in station system by the prediction to site traffic quantity No missing, to find abnormal website.This respect data are widely different according to the difference in regional province.It is enough for data volume Province, traditional machine learning method can obtain good effect, such as support vector regression, neural network scheduling algorithm, But traditional machine learning requires the distribution of training data and test data to be consistent, the data in each province can not be put It trains together, so will go wrong for the training in the small area of data volume, if by force with a regional data Carry out analysis can because data not enough and obtain bad model, or the data in each province put together training can because Each data set distribution inconsistent and cause modelling effect to be deteriorated.The data using other provinces are proposed the present invention is based on this By transfer learning method training objective data, achieve the purpose that abnormal website detection.
Transfer learning is a new field of machine learning, its purpose is using already present knowledge to different but phase Pass field is trained study.Transfer learning relaxes two primary conditions of conventional machines study:Training data and test number According to meeting independent same distribution, and possesses enough data and train a good model.Studies have shown that two different fields Similarity is higher, and transfer learning is easier, and effect is better, and otherwise effect is often bad, or even " negative transfer " occurs and obtain result. Domain adaptation are the hotter research directions in transfer learning field, and Pan et al. proposes the TCA of domain adaptation (Transfer Component Analysis) algorithm, TCA belong to the transfer learning method of feature based, its thought is to work as When source domain and aiming field are in different data distribution, the data in two fields are mapped to the reproducing kernel Xi Er of a higher-dimension together Bert space.In this space, the data distance of source and target is minimized, while farthest retaining their own inside Attribute.TCA algorithms only consider target domain and source domain data in another correlation spatially, excessively single, while TCA The time complexity of algorithm is relatively high.Dai et al. proposes the TrAdaBoost (Transfer of Case-based Reasoning correlation AdaBoost) algorithm, the thought of the algorithm be it is maximum utilize source data, find in source data with the relevant data of target data, Then learn with target data together training.But TrAdaBoost algorithms are only utilized a source data, the result of algorithm according to Rely the correlation with target data in source data, the correctness property associated with the data of algorithm is directly proportional, if related very weak, is easy Generate negative transfer.The correlation that Yao et al. passes through consideration multiple sources and target, it is proposed that two kinds of multi-source transfer learning algorithms, point It is not MTrA (MultiSource-TrAdaBoost) and TTrA (Task-TrAdaBoost), the thought of MTrA algorithms is source number According to being there are multiple data sources, trained using current iteration and the strongest data source of target data correlation during each iteration Then Weak Classifier obtains strong classifier;The thought of TTrA algorithms, which is each iteration, to be respectively trained one with all source datas A Weak Classifier, then the grader of selection and target data error minimum, one strong by these combining classifiers after iteration is complete Grader.Both each iteration of multi-source transfer learning algorithm can all select with the strongest data source of target data correlation, Although can guarantee that the source data of migration is most related to target in this way, they are without the information using other data sources, in reality The cost of each data source is very high in the production of border, and this operation wastes a large amount of resource of company.Data in TMS systems The problem of quality, has seriously affected judgement and operation of the company to practical business, the distributional differences of each department data, data volume Difference also gives the challenge of the discovery band of data quality problem.
Invention content
To solve the above-mentioned problems, the present invention proposes a kind of data verification method based on multi-source transfer learning, this hair Technical solution is used by bright:
Step 1:Type of site, website voltage class, website scheduling grade, website, which are obtained, by system data table builds up year Optical transmission device quantity, website said system and the website centrad being calculated by pagerank algorithms in limit, website Structure the website attribute further builds set of source data by the website attribute of each website in each province and is normalized, passes through prediction The website attribute in province further builds target training set and is normalized, and extracts set of source data and target training set corresponds to Site traffic quantity and be normalized;
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target The weight of each website in province, and initialize the website in source data and target province in weighting multi-source TrAdaBoost algorithms and weigh Weight, by merging normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively And normalization number of services obtains merging training set;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error ginseng are established by step 2 Number;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 6:The website attribute in target province is predicted to obtain the prediction in target province with final prediction model Site traffic quantity, and renormalization is carried out to prediction site traffic quantity.
Preferably, the attribute of website described in step 1, that is, feature vector is:
Wherein,For province SkThe website attribute of website m,N is the quantity in province,For province SkThe quantity of website,For province SkThe type of site of website m,For province Sk The website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkWebsite m Website build up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkWebsite m Website said system,For province SkThe website centrad of website m;
Type of site, website voltage class, website scheduling grade, website can be obtained from the tables of data of system builds up year Optical transmission device quantity, website said system in limit, website, province SkThe website centrad calculating process of website m is root first It is initialized according to the degree and website quantity of website:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite The degree of m further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite m There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website The optical cable number with outer connection, α is damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N is the quantity in the larger province of data volume,For SkA source data, SkA source data, that is, province Sk IncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m,SNFor the quantity in province,For province SkThe quantity of website,For province Sk The type of site of website m,For province SkThe website voltage class of website m,For province SkWebsite The website of m dispatches grade,For province SkThe website of website m builds up the time limit,For province SkThe website of website m Middle optical transmission device quantity,For province SkThe website said system of website m,For province SkIn the website of website m Heart degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STIt stands Point i (i ∈ [1, nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province ST The website voltage class of website i,To predict province STThe website of website i dispatches grade,It is saved for prediction Part STThe website of website i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,To predict province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedAnd normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈ [1, SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection For:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service Quantity training collectionIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing Normalization number of services data set is respectively obtained after the standardized normalization of min-maxWith normalization target service quantity training Collection
Preferably, the SVR models based on weight described in step 2 are by understanding normalization source number described in step 1 It is according to collection:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of websiteSize,For training datasetMiddle province SkThe normalization number of services of website i,For training number According to collectionMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein,For province SkThe type of site of website m,For Province SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province Sk The website of website m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkThe website said system of website m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε is insensitive loss value, as province SkThe normalization number of services of website iWith regression estimates function Predicted valueBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training dataset In nonlinear transformation to another feature space, and regression estimates letter is constructed in feature space after Radial basis kernel function transformation Number, and initialize SkA normalization source dataIn weightRadial basis kernel function is public Formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξ 'iFor the second slack variable parameter, ε is insensitive loss value, and C is mould Shape parameter, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, will be excellent Change problem is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α'iValue, simultaneously KKT conditions should be met, therefore had:
Find out Model Weight parameter q, straggling parameter b:
Wherein,Finally obtain regressive prediction model:
Preferably, each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N is province Quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N be province number Amount;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N For the quantity in province,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N is province Quantity;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website in target province Quantity;
Preferably, training dataset D will be merged described in step 4k,Yk, weighting multi-source TrAdaBoost algorithms in source number According to website weightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is the number of source data Amount is the quantity in province,For province S in the t times iterationkThe first Lagrangian of website i,For in the t times iteration Province SkThe second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i, For province SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection The error in the t times iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province STThe number of services predicted value of website i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate Prediction modelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Meanwhile calculating candidate prediction model htIn target detection data DT,YTOn error, wt,iFor target province data station The weight of point:
Parameter phi for updating sample weights is sett
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STWebsite i Number of services predicted value,It is target province STNumber of services, that is, actual value of website i, ε are insensitive loss value,For The website quantity in target province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The website industry that the t times iteration obtains Business quantitative forecast value,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameterFor:
Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number;
Preferably, repeating step 4 described in step 5 to maximum iteration and calculating final prediction model and be
Final prediction model f (x) is calculated if t=M:
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process;
Preferably, for target province S described in step 6TWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
Compared with prior art, present invention saves data resources, improve the quality of data.
Description of the drawings
Fig. 1:For flow chart of the method for the present invention.
Specific implementation mode
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.
The specific steps that the embodiment of the present invention is introduced with reference to Fig. 1, the present invention provides one kind to be learned based on multi-source migration The data verification method of habit, the specific steps are:
Step 1:Type of site, website voltage class, website scheduling grade, website, which are obtained, by system data table builds up year Optical transmission device quantity, website said system and the website centrad being calculated by pagerank algorithms in limit, website Structure the website attribute further builds set of source data by the website attribute of each website in each province and is normalized, passes through prediction The website attribute in province further builds target training set and is normalized, and extracts set of source data and target training set corresponds to Site traffic quantity and be normalized;
The attribute of website described in step 1, that is, feature vector is:
Wherein,For province SkThe website attribute of website m,N=10 is the number in province Amount,For province SkThe quantity of website,For province SkThe type of site of website m,To save Part SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkIt stands The website of point m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkIt stands The website said system of point m,For province SkThe website centrad of website m;
Type of site, website voltage class, website scheduling grade, website can be obtained from the tables of data of system builds up year Optical transmission device quantity, website said system in limit, website, province SkThe website centrad calculating process of website m is root first It is initialized according to the degree and website quantity of website:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite The degree of m further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite m There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website The optical cable number with outer connection, α=0.85 be damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N=10 is the quantity in the larger province of data volume,For SkA source data, SkA source data saves Part SkIncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m,SNFor the quantity in province,For province SkThe quantity of website,For province Sk The type of site of website m,For province SkThe website voltage class of website m,For province SkWebsite The website of m dispatches grade,For province SkThe website of website m builds up the time limit,For province SkThe website of website m Middle optical transmission device quantity,For province SkThe website said system of website m,For province SkIn the website of website m Heart degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STIt stands Point i (i ∈ [1, nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province ST The website voltage class of website i,To predict province STThe website of website i dispatches grade,It is saved for prediction Part STThe website of website i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,To predict province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedAnd normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈ [1, SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection For:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service Quantity training collectionIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing Normalization number of services data set is respectively obtained after the standardized normalization of min-maxWith normalization target service quantity training Collection
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
SVR models based on weight described in step 2 are by understanding that normalizing set of source data is described in step 1:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of websiteSize,For training datasetMiddle province SkThe normalization number of services of website i,For training number According to collectionMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein,For province SkThe type of site of website m,For Province SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province Sk The website of website m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkThe website said system of website m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε=1/e is insensitive loss value, as province SkThe normalization number of services of website iAnd regression estimates The predicted value of functionBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training dataset In nonlinear transformation to another feature space, and regression estimates letter is constructed in feature space after Radial basis kernel function transformation Number, and initialize SkA normalization source dataIn weightRadial basis kernel function is public Formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξ 'iFor the second slack variable parameter, ε=1/e is insensitive loss value, C It is model parameter, q is the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, Optimization problem is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α'iValue, simultaneously KKT conditions should be met, therefore had:
Find out Model Weight parameter q, straggling parameter b:
Wherein,Finally obtain regressive prediction model:
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target The weight of each website in province, and initialize the website in source data and target province in weighting multi-source TrAdaBoost algorithms and weigh Weight, by merging normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively And normalization number of services obtains merging training set;
Each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N is province Quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N is province Quantity;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N =10 be the quantity in province,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N=10 is The quantity in province;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website in target province Quantity;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error ginseng are established by step 2 Number;
Training dataset D will be merged described in step 4k,Yk, weighting multi-source TrAdaBoost algorithms in source data website WeightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is the number of source data Amount is the quantity in province,For province S in the t times iterationkThe first Lagrangian of website i,For in the t times iteration Province SkThe second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i, For province SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection The error in the t times iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province STThe number of services predicted value of website i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate Prediction modelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Parameter phi for updating sample weights is sett
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STWebsite i Number of services predicted value,It is target province STNumber of services, that is, actual value of website i, ε=1/e are insensitive loss value,For the website quantity in target province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The website industry that the t times iteration obtains Business quantitative forecast value,It is site traffic quantity actual value, ε=1/e is insensitive loss value,For the station number in province, ParameterFor:
Wherein, M=200 is maximum iteration, and t is current iteration number t ∈ [1, M], according to source number in step 1 According toIn,For the summation of each province website number;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 4 is repeated described in step 5 to maximum iteration and calculating final prediction model and be
If t=M, M=200 then calculate final prediction model f (x):
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process;
Step 6:The website attribute in target province is predicted to obtain the prediction in target province with final prediction model Site traffic quantity, and renormalization is carried out to prediction site traffic quantity.
For target province S described in step 6TWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims (7)

1. a kind of data verification method based on multi-source transfer learning, which is characterized in that include the following steps:
Step 1:By system data table obtain type of site, website voltage class, website dispatch grade, website build up the time limit, Optical transmission device quantity, website said system and the website centrad structure being calculated by pagerank algorithms in website Website attribute further builds set of source data by the website attribute of each website in each province and is normalized, by predicting province Website attribute further build target training set and be normalized, extract set of source data and the corresponding station of target training set Point number of services is simultaneously normalized;
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target province The weight of each website, and source data and the website weight in target province in weighting multi-source TrAdaBoost algorithms are initialized, lead to It crosses and merges normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively and return One change number of services obtains merging training set;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error parameter are established by step 2;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 6:The website attribute in target province is predicted with final prediction model to obtain the prediction website in target province Number of services, and renormalization is carried out to prediction site traffic quantity.
2. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 1 Stating website attribute i.e. feature vector is:
Wherein,For province SkThe website attribute of website m, Sk∈[1,SN],N is the quantity in province,To save Part SkThe quantity of website,For province SkThe type of site of website m,For province SkWebsite m's Website voltage class,For province SkThe website of website m dispatches grade,For province SkThe website of website m is built At the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkThe website of website m Said system,For province SkThe website centrad of website m;
Type of site can be obtained from the tables of data of system, website voltage class, website scheduling grade, website build up the time limit, Optical transmission device quantity, website said system, province S in websitekThe website centrad calculating process of website m is according to station first The degree and website quantity of point are initialized:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite m's Degree further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite M has the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For websiteThe optical cable number with outer connection, α is damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N is the quantity in the larger province of data volume,For SkA source data, SkA source data, that is, province SkIncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m, Sk∈[1, SN],SNFor the quantity in province,For province SkThe quantity of website,For province SkWebsite m Type of site,For province SkThe website voltage class of website m,For province SkThe station of website m Point scheduling grade,For province SkThe website of website m builds up the time limit,For province SkLight in the website of website m Transmission device quantity,For province SkThe website said system of website m,For province SkThe website center of website m Degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STWebsite i (i ∈[1,nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province STWebsite i Website voltage class,To predict province STThe website of website i dispatches grade,To predict province STIt stands The website of point i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,It is pre- Survey province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedWith And normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈[1,SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection and be:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service quantity Training setIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing min- Normalization number of services data set is respectively obtained after the standardized normalization of maxWith normalization target service quantity training collection
3. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 2 It is by understanding that normalizing set of source data is described in step 1 to state the SVR models based on weight:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of website's Size,For training datasetMiddle province SkThe normalization number of services of website i,For training datasetMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein, For province SkThe type of site of website m,For province Sk The website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkWebsite m Website build up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkWebsite The website said system of m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε is insensitive loss value, as province SkThe normalization number of services of website iWith the prediction of regression estimates function ValueBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training datasetIt is non-linear It transforms in another feature space, and constructs regression estimates function in feature space after Radial basis kernel function transformation, and just Beginningization SkA normalization source dataIn weightRadial basis kernel function formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξi' it is the second slack variable parameter, ε is insensitive loss value, and C is model ginseng Number, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, optimization is asked Topic is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α’iValue, while should expire Sufficient KKT conditions, therefore have:
Find out Model Weight parameter q, straggling parameter b:
Wherein, 0 < αi,Finally obtain regressive prediction model:
4. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:
Each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein, it is province SkSample number, that is, website quantity, target province STWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N be province quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N be province quantity;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N is province Quantity,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N be province number Amount;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website quantity in target province.
5. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 4 Training dataset D will be merged by statingk,Yk, weighting multi-source TrAdaBoost algorithms in source data website weightAnd target saves PartWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is that the quantity of source data saves The quantity of part,For province S in the t times iterationkThe first Lagrangian of website i,For province S in the t times iterationk The second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i,For province SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection Error in t iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province STIt stands The number of services predicted value of point i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate prediction ModelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Meanwhile calculating candidate prediction model htIn target detection data DT,YTOn error, wt,iFor target province data station Weight:
Parameter phi for updating sample weights is sett
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STThe industry of website i Business quantitative forecast value,It is target province STNumber of services, that is, actual value of website i, ε are insensitive loss value,For target The website quantity in province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The site traffic number that the t times iteration obtains Predicted value is measured,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameter For:
Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number.
6. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 5 It states and repeats step 4 to maximum iteration and calculating final prediction model and be:
Final prediction model f (x) is calculated if t=M:
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process.
7. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 6 It states for target province STWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
CN201810320808.6A 2018-04-11 2018-04-11 Data verification method based on multi-source transfer learning Active CN108549907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810320808.6A CN108549907B (en) 2018-04-11 2018-04-11 Data verification method based on multi-source transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810320808.6A CN108549907B (en) 2018-04-11 2018-04-11 Data verification method based on multi-source transfer learning

Publications (2)

Publication Number Publication Date
CN108549907A true CN108549907A (en) 2018-09-18
CN108549907B CN108549907B (en) 2021-11-16

Family

ID=63514421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810320808.6A Active CN108549907B (en) 2018-04-11 2018-04-11 Data verification method based on multi-source transfer learning

Country Status (1)

Country Link
CN (1) CN108549907B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110398986A (en) * 2019-04-28 2019-11-01 清华大学 A kind of intensive woods cognition technology of unmanned plane of multi-source data migration
CN110457646A (en) * 2019-06-26 2019-11-15 中国政法大学 One kind being based on parameter transfer learning low-resource head-position difficult labor personalized method
CN110674648A (en) * 2019-09-29 2020-01-10 厦门大学 Neural network machine translation model based on iterative bidirectional migration
WO2020168676A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Method for constructing network fault handling model, fault handling method and system
CN112651173A (en) * 2020-12-18 2021-04-13 浙江大学 Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100069942A (en) * 2008-12-17 2010-06-25 한양대학교 산학협력단 Method for cooperative transmitting data in wireless multihop network and system thereof
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification
CN106296044A (en) * 2016-10-08 2017-01-04 南方电网科学研究院有限责任公司 power system risk scheduling method and system
CN106651188A (en) * 2016-12-27 2017-05-10 贵州电网有限责任公司贵阳供电局 Electric transmission and transformation device multi-source state assessment data processing method and application thereof
CN107818523A (en) * 2017-11-14 2018-03-20 国网江西省电力公司信息通信分公司 Power communication system data true value based on unstable frequency distribution and frequency factor study differentiates and estimating method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100069942A (en) * 2008-12-17 2010-06-25 한양대학교 산학협력단 Method for cooperative transmitting data in wireless multihop network and system thereof
CN104199857A (en) * 2014-08-14 2014-12-10 西安交通大学 Tax document hierarchical classification method based on multi-tag classification
CN106296044A (en) * 2016-10-08 2017-01-04 南方电网科学研究院有限责任公司 power system risk scheduling method and system
CN106651188A (en) * 2016-12-27 2017-05-10 贵州电网有限责任公司贵阳供电局 Electric transmission and transformation device multi-source state assessment data processing method and application thereof
CN107818523A (en) * 2017-11-14 2018-03-20 国网江西省电力公司信息通信分公司 Power communication system data true value based on unstable frequency distribution and frequency factor study differentiates and estimating method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020168676A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Method for constructing network fault handling model, fault handling method and system
CN110398986A (en) * 2019-04-28 2019-11-01 清华大学 A kind of intensive woods cognition technology of unmanned plane of multi-source data migration
CN110457646A (en) * 2019-06-26 2019-11-15 中国政法大学 One kind being based on parameter transfer learning low-resource head-position difficult labor personalized method
CN110457646B (en) * 2019-06-26 2022-12-13 中国政法大学 Low-resource head-related transfer function personalization method based on parameter migration learning
CN110674648A (en) * 2019-09-29 2020-01-10 厦门大学 Neural network machine translation model based on iterative bidirectional migration
CN110674648B (en) * 2019-09-29 2021-04-27 厦门大学 Neural network machine translation model based on iterative bidirectional migration
CN112651173A (en) * 2020-12-18 2021-04-13 浙江大学 Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system
CN112651173B (en) * 2020-12-18 2022-04-29 浙江大学 Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system

Also Published As

Publication number Publication date
CN108549907B (en) 2021-11-16

Similar Documents

Publication Publication Date Title
Dong et al. Hourly energy consumption prediction of an office building based on ensemble learning and energy consumption pattern classification
CN108549907A (en) A kind of data verification method based on multi-source transfer learning
CN105117602B (en) A kind of metering device running status method for early warning
CN109461025A (en) A kind of electric energy substitution potential customers' prediction technique based on machine learning
Feng et al. Design and simulation of human resource allocation model based on double-cycle neural network
CN109583635A (en) A kind of short-term load forecasting modeling method towards operational reliability
CN108846691A (en) Regional grain and oil market price monitoring analysing and predicting system and monitoring method
CN109102157A (en) A kind of bank's work order worksheet processing method and system based on deep learning
CN110309884A (en) Electricity consumption data anomalous identification system based on ubiquitous electric power Internet of Things net system
CN106067079A (en) A kind of system and method for gray haze based on BP neutral net prediction
Wang et al. Dealing with alarms in optical networks using an intelligent system
CN110990718A (en) Social network model building module of company image improving system
Lv et al. Media information dissemination model of wireless networks using deep residual network
CN102708298B (en) A kind of Vehicular communication system electromagnetic compatibility index distribution method
CN110533341A (en) A kind of Livable City evaluation method based on BP neural network
CN109889981A (en) A kind of localization method and system based on two sorting techniques
Tang et al. Leveraging socioeconomic information and deep learning for residential load pattern prediction
Dong et al. Research on academic early warning model based on improved SVM algorithm
Ragapriya et al. Machine Learning Based House Price Prediction Using Modified Extreme Boosting
Zhuang et al. DyS-IENN: a novel multiclass imbalanced learning method for early warning of tardiness in rocket final assembly process
CN112016631A (en) Improvement scheme related to low-voltage treatment
Liao et al. Building energy efficiency assessment base on predict-center criterion under diversified conditions
Wang SVR short-term traffic flow forecasting model based on spatial-temporal feature selection
CN111027845A (en) Label model suitable for power market main part customer portrait
CN109886460A (en) The prediction technique of tunnel subsidence time series based on adaboost

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant