CN108549907A - A kind of data verification method based on multi-source transfer learning - Google Patents
A kind of data verification method based on multi-source transfer learning Download PDFInfo
- Publication number
- CN108549907A CN108549907A CN201810320808.6A CN201810320808A CN108549907A CN 108549907 A CN108549907 A CN 108549907A CN 201810320808 A CN201810320808 A CN 201810320808A CN 108549907 A CN108549907 A CN 108549907A
- Authority
- CN
- China
- Prior art keywords
- website
- province
- target
- normalization
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/245—Classification techniques relating to the decision surface
Abstract
The present invention proposes a kind of data verification method based on multi-source transfer learning.The method of the present invention is extraction set of source data and the corresponding site traffic quantity of target training set and is normalized;The SVR models based on weight are built by transfer learning model SVR models and radial basis function;Initialization source data and the website weight in target province are simultaneously normalized, and obtain merging training set by merging normalization set of source data, normalization target training dataset, normalization business datum amount training set and normalization number of services respectively;Training set will be merged and normalized vector establishes prediction model and computation model error parameter;Successive ignition simultaneously calculates final prediction model;The prediction site traffic quantity in target province is obtained with final prediction model, and renormalization is carried out to prediction site traffic quantity;Compared with prior art, the present invention improves the quality of data, has saved data resource.
Description
Technical field
The invention belongs to the scope of transfer learning more particularly to a kind of data verification methods based on multi-source transfer learning.
Background technology
Second physical network of the national grid communications management system (TMS) as Utilities Electric Co., carries operation of power networks
It is power grid security, stabilization, the important leverage of economical operation with the core business of management.As Utilities Electric Co.'s communication speciality
Hard core control system, TMS systems have played great function in terms of resource management, real time monitoring, operational management, while also accumulating
Mass data is tired out.TMS systems are preserved in the form of database, and constituent parts independently dispose database server.Mainly
Including the business datum that TMS resource datas, alarm data, work order data, internal each module generate;Network Communication company of state and each point
Portion, provincial company, municipality directly under the Central Government, resource possession between the superior and the subordinate's system, trouble ticket dispatch, statistical data report, mission dispatching, alarm
Data such as report at the data;There are the stream compressions such as account data, workflow for external system at the same level.But TMS system data matter
Amount problem drastically influences data analysis and decision in actual production, be mainly manifested in static resource data do not conform to the actual conditions,
Dynamic resource data correlation mistake, basic data are fresh-keeping to implement three aspects not in place, and it is logical for electric power mainly to affect TMS systems
Letter lean management provides the realistic meaning of strong support.At the same time, different provinces size of data differences are very in TMS systems
Greatly.The smaller provincial company data scale of network size is in 1G~2G, as the big unit of the network sizes such as state's Network Communication, data scale
Reach 30G~40G, particularly for some particular services data relatively from far-off regions even only hundreds of kb, these data
It is not enough to one good conventional machines learning model of training at all.
The data quality problems such as the loss of data, mistake, expired are always an important topic of big data analysis, every year
Should be data quality problem can all bring huge loss to society.It is shown according to the investigation of German data analysis mechanism:" the U.S.
It is annual because bad data and caused by up to 600,000,000,000 dollars of loss ", in U.S.'s malpractice caused by error in data
98000 patients are made to die every year.For TMS systems, power business management frequency is low, and service management data are mostly with the statement form moon
Based on degrees of data, it is not carried out the daily management of (or higher frequency) to business progress and state.Secondly, business procedure data record
Enter and safeguards that data generated time lags behind business procedure, does not meet actual data largely to produce, this phenomenon not in time
On company on the business in actual production carry out judge and decision bring serious influence, so we carry out data analysis it
Preceding necessary focused data itself.The present invention judges that number of services is in station system by the prediction to site traffic quantity
No missing, to find abnormal website.This respect data are widely different according to the difference in regional province.It is enough for data volume
Province, traditional machine learning method can obtain good effect, such as support vector regression, neural network scheduling algorithm,
But traditional machine learning requires the distribution of training data and test data to be consistent, the data in each province can not be put
It trains together, so will go wrong for the training in the small area of data volume, if by force with a regional data
Carry out analysis can because data not enough and obtain bad model, or the data in each province put together training can because
Each data set distribution inconsistent and cause modelling effect to be deteriorated.The data using other provinces are proposed the present invention is based on this
By transfer learning method training objective data, achieve the purpose that abnormal website detection.
Transfer learning is a new field of machine learning, its purpose is using already present knowledge to different but phase
Pass field is trained study.Transfer learning relaxes two primary conditions of conventional machines study:Training data and test number
According to meeting independent same distribution, and possesses enough data and train a good model.Studies have shown that two different fields
Similarity is higher, and transfer learning is easier, and effect is better, and otherwise effect is often bad, or even " negative transfer " occurs and obtain result.
Domain adaptation are the hotter research directions in transfer learning field, and Pan et al. proposes the TCA of domain adaptation
(Transfer Component Analysis) algorithm, TCA belong to the transfer learning method of feature based, its thought is to work as
When source domain and aiming field are in different data distribution, the data in two fields are mapped to the reproducing kernel Xi Er of a higher-dimension together
Bert space.In this space, the data distance of source and target is minimized, while farthest retaining their own inside
Attribute.TCA algorithms only consider target domain and source domain data in another correlation spatially, excessively single, while TCA
The time complexity of algorithm is relatively high.Dai et al. proposes the TrAdaBoost (Transfer of Case-based Reasoning correlation
AdaBoost) algorithm, the thought of the algorithm be it is maximum utilize source data, find in source data with the relevant data of target data,
Then learn with target data together training.But TrAdaBoost algorithms are only utilized a source data, the result of algorithm according to
Rely the correlation with target data in source data, the correctness property associated with the data of algorithm is directly proportional, if related very weak, is easy
Generate negative transfer.The correlation that Yao et al. passes through consideration multiple sources and target, it is proposed that two kinds of multi-source transfer learning algorithms, point
It is not MTrA (MultiSource-TrAdaBoost) and TTrA (Task-TrAdaBoost), the thought of MTrA algorithms is source number
According to being there are multiple data sources, trained using current iteration and the strongest data source of target data correlation during each iteration
Then Weak Classifier obtains strong classifier;The thought of TTrA algorithms, which is each iteration, to be respectively trained one with all source datas
A Weak Classifier, then the grader of selection and target data error minimum, one strong by these combining classifiers after iteration is complete
Grader.Both each iteration of multi-source transfer learning algorithm can all select with the strongest data source of target data correlation,
Although can guarantee that the source data of migration is most related to target in this way, they are without the information using other data sources, in reality
The cost of each data source is very high in the production of border, and this operation wastes a large amount of resource of company.Data in TMS systems
The problem of quality, has seriously affected judgement and operation of the company to practical business, the distributional differences of each department data, data volume
Difference also gives the challenge of the discovery band of data quality problem.
Invention content
To solve the above-mentioned problems, the present invention proposes a kind of data verification method based on multi-source transfer learning, this hair
Technical solution is used by bright:
Step 1:Type of site, website voltage class, website scheduling grade, website, which are obtained, by system data table builds up year
Optical transmission device quantity, website said system and the website centrad being calculated by pagerank algorithms in limit, website
Structure the website attribute further builds set of source data by the website attribute of each website in each province and is normalized, passes through prediction
The website attribute in province further builds target training set and is normalized, and extracts set of source data and target training set corresponds to
Site traffic quantity and be normalized;
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target
The weight of each website in province, and initialize the website in source data and target province in weighting multi-source TrAdaBoost algorithms and weigh
Weight, by merging normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively
And normalization number of services obtains merging training set;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error ginseng are established by step 2
Number;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 6:The website attribute in target province is predicted to obtain the prediction in target province with final prediction model
Site traffic quantity, and renormalization is carried out to prediction site traffic quantity.
Preferably, the attribute of website described in step 1, that is, feature vector is:
Wherein,For province SkThe website attribute of website m,N is the quantity in province,For province SkThe quantity of website,For province SkThe type of site of website m,For province Sk
The website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkWebsite m
Website build up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkWebsite m
Website said system,For province SkThe website centrad of website m;
Type of site, website voltage class, website scheduling grade, website can be obtained from the tables of data of system builds up year
Optical transmission device quantity, website said system in limit, website, province SkThe website centrad calculating process of website m is root first
It is initialized according to the degree and website quantity of website:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite
The degree of m further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite m
There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website
The optical cable number with outer connection, α is damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N is the quantity in the larger province of data volume,For SkA source data, SkA source data, that is, province Sk
IncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m,SNFor the quantity in province,For province SkThe quantity of website,For province Sk
The type of site of website m,For province SkThe website voltage class of website m,For province SkWebsite
The website of m dispatches grade,For province SkThe website of website m builds up the time limit,For province SkThe website of website m
Middle optical transmission device quantity,For province SkThe website said system of website m,For province SkIn the website of website m
Heart degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STIt stands
Point i (i ∈ [1, nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province ST
The website voltage class of website i,To predict province STThe website of website i dispatches grade,It is saved for prediction
Part STThe website of website i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,To predict province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedAnd normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈ [1, SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection
For:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service
Quantity training collectionIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing
Normalization number of services data set is respectively obtained after the standardized normalization of min-maxWith normalization target service quantity training
Collection
Preferably, the SVR models based on weight described in step 2 are by understanding normalization source number described in step 1
It is according to collection:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of websiteSize,For training datasetMiddle province SkThe normalization number of services of website i,For training number
According to collectionMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein,For province SkThe type of site of website m,For
Province SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province Sk
The website of website m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province
SkThe website said system of website m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε is insensitive loss value, as province SkThe normalization number of services of website iWith regression estimates function
Predicted valueBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training dataset
In nonlinear transformation to another feature space, and regression estimates letter is constructed in feature space after Radial basis kernel function transformation
Number, and initialize SkA normalization source dataIn weightRadial basis kernel function is public
Formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξ 'iFor the second slack variable parameter, ε is insensitive loss value, and C is mould
Shape parameter, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, will be excellent
Change problem is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α'iValue, simultaneously
KKT conditions should be met, therefore had:
Find out Model Weight parameter q, straggling parameter b:
Wherein,Finally obtain regressive prediction model:
Preferably, each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N is province
Quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N be province number
Amount;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N
For the quantity in province,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N is province
Quantity;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website in target province
Quantity;
Preferably, training dataset D will be merged described in step 4k,Yk, weighting multi-source TrAdaBoost algorithms in source number
According to website weightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is the number of source data
Amount is the quantity in province,For province S in the t times iterationkThe first Lagrangian of website i,For in the t times iteration
Province SkThe second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i,
For province SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection
The error in the t times iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province
STThe number of services predicted value of website i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate
Prediction modelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Meanwhile calculating candidate prediction model htIn target detection data DT,YTOn error, wt,iFor target province data station
The weight of point:
Parameter phi for updating sample weights is sett:
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STWebsite i
Number of services predicted value,It is target province STNumber of services, that is, actual value of website i, ε are insensitive loss value,For
The website quantity in target province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The website industry that the t times iteration obtains
Business quantitative forecast value,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameterFor:
Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number;
Preferably, repeating step 4 described in step 5 to maximum iteration and calculating final prediction model and be
Final prediction model f (x) is calculated if t=M:
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process;
Preferably, for target province S described in step 6TWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
Compared with prior art, present invention saves data resources, improve the quality of data.
Description of the drawings
Fig. 1:For flow chart of the method for the present invention.
Specific implementation mode
Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair
It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not
For limiting the present invention.
The specific steps that the embodiment of the present invention is introduced with reference to Fig. 1, the present invention provides one kind to be learned based on multi-source migration
The data verification method of habit, the specific steps are:
Step 1:Type of site, website voltage class, website scheduling grade, website, which are obtained, by system data table builds up year
Optical transmission device quantity, website said system and the website centrad being calculated by pagerank algorithms in limit, website
Structure the website attribute further builds set of source data by the website attribute of each website in each province and is normalized, passes through prediction
The website attribute in province further builds target training set and is normalized, and extracts set of source data and target training set corresponds to
Site traffic quantity and be normalized;
The attribute of website described in step 1, that is, feature vector is:
Wherein,For province SkThe website attribute of website m,N=10 is the number in province
Amount,For province SkThe quantity of website,For province SkThe type of site of website m,To save
Part SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkIt stands
The website of point m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkIt stands
The website said system of point m,For province SkThe website centrad of website m;
Type of site, website voltage class, website scheduling grade, website can be obtained from the tables of data of system builds up year
Optical transmission device quantity, website said system in limit, website, province SkThe website centrad calculating process of website m is root first
It is initialized according to the degree and website quantity of website:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite
The degree of m further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite m
There is the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For website
The optical cable number with outer connection, α=0.85 be damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N=10 is the quantity in the larger province of data volume,For SkA source data, SkA source data saves
Part SkIncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m,SNFor the quantity in province,For province SkThe quantity of website,For province Sk
The type of site of website m,For province SkThe website voltage class of website m,For province SkWebsite
The website of m dispatches grade,For province SkThe website of website m builds up the time limit,For province SkThe website of website m
Middle optical transmission device quantity,For province SkThe website said system of website m,For province SkIn the website of website m
Heart degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STIt stands
Point i (i ∈ [1, nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province ST
The website voltage class of website i,To predict province STThe website of website i dispatches grade,It is saved for prediction
Part STThe website of website i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,To predict province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedAnd normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈ [1, SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection
For:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service
Quantity training collectionIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing
Normalization number of services data set is respectively obtained after the standardized normalization of min-maxWith normalization target service quantity training
Collection
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
SVR models based on weight described in step 2 are by understanding that normalizing set of source data is described in step 1:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of websiteSize,For training datasetMiddle province SkThe normalization number of services of website i,For training number
According to collectionMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein,For province SkThe type of site of website m,For
Province SkThe website voltage class of website m,For province SkThe website of website m dispatches grade,For province Sk
The website of website m builds up the time limit,For province SkOptical transmission device quantity in the website of website m,For province
SkThe website said system of website m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε=1/e is insensitive loss value, as province SkThe normalization number of services of website iAnd regression estimates
The predicted value of functionBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training dataset
In nonlinear transformation to another feature space, and regression estimates letter is constructed in feature space after Radial basis kernel function transformation
Number, and initialize SkA normalization source dataIn weightRadial basis kernel function is public
Formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξ 'iFor the second slack variable parameter, ε=1/e is insensitive loss value, C
It is model parameter, q is the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem,
Optimization problem is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α'iValue, simultaneously
KKT conditions should be met, therefore had:
Find out Model Weight parameter q, straggling parameter b:
Wherein,Finally obtain regressive prediction model:
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target
The weight of each website in province, and initialize the website in source data and target province in weighting multi-source TrAdaBoost algorithms and weigh
Weight, by merging normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively
And normalization number of services obtains merging training set;
Each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N is province
Quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N is province
Quantity;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N
=10 be the quantity in province,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N=10 is
The quantity in province;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website in target province
Quantity;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error ginseng are established by step 2
Number;
Training dataset D will be merged described in step 4k,Yk, weighting multi-source TrAdaBoost algorithms in source data website
WeightAnd target provinceWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is the number of source data
Amount is the quantity in province,For province S in the t times iterationkThe first Lagrangian of website i,For in the t times iteration
Province SkThe second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i,
For province SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection
The error in the t times iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province
STThe number of services predicted value of website i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate
Prediction modelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Parameter phi for updating sample weights is sett:
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STWebsite i
Number of services predicted value,It is target province STNumber of services, that is, actual value of website i, ε=1/e are insensitive loss value,For the website quantity in target province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The website industry that the t times iteration obtains
Business quantitative forecast value,It is site traffic quantity actual value, ε=1/e is insensitive loss value,For the station number in province,
ParameterFor:
Wherein, M=200 is maximum iteration, and t is current iteration number t ∈ [1, M], according to source number in step 1
According toIn,For the summation of each province website number;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 4 is repeated described in step 5 to maximum iteration and calculating final prediction model and be
If t=M, M=200 then calculate final prediction model f (x):
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process;
Step 6:The website attribute in target province is predicted to obtain the prediction in target province with final prediction model
Site traffic quantity, and renormalization is carried out to prediction site traffic quantity.
For target province S described in step 6TWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this
The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention
Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair
It is bright range is claimed to be determined by the appended claims.
Claims (7)
1. a kind of data verification method based on multi-source transfer learning, which is characterized in that include the following steps:
Step 1:By system data table obtain type of site, website voltage class, website dispatch grade, website build up the time limit,
Optical transmission device quantity, website said system and the website centrad structure being calculated by pagerank algorithms in website
Website attribute further builds set of source data by the website attribute of each website in each province and is normalized, by predicting province
Website attribute further build target training set and be normalized, extract set of source data and the corresponding station of target training set
Point number of services is simultaneously normalized;
Step 2:The SVR models based on weight are built by transfer learning model SVR models and radial basis function;
Step 3:Initialize source data and the weight of each website in target province, normalization initialization source data and target province
The weight of each website, and source data and the website weight in target province in weighting multi-source TrAdaBoost algorithms are initialized, lead to
It crosses and merges normalization set of source data and normalization target training dataset, normalization business datum amount training set respectively and return
One change number of services obtains merging training set;
Step 4:Training set and normalized vector will be merged, prediction model and computation model error parameter are established by step 2;
Step 5:Step 4 is repeated to maximum iteration and calculating final prediction model;
Step 6:The website attribute in target province is predicted with final prediction model to obtain the prediction website in target province
Number of services, and renormalization is carried out to prediction site traffic quantity.
2. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 1
Stating website attribute i.e. feature vector is:
Wherein,For province SkThe website attribute of website m, Sk∈[1,SN],N is the quantity in province,To save
Part SkThe quantity of website,For province SkThe type of site of website m,For province SkWebsite m's
Website voltage class,For province SkThe website of website m dispatches grade,For province SkThe website of website m is built
At the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkThe website of website m
Said system,For province SkThe website centrad of website m;
Type of site can be obtained from the tables of data of system, website voltage class, website scheduling grade, website build up the time limit,
Optical transmission device quantity, website said system, province S in websitekThe website centrad calculating process of website m is according to station first
The degree and website quantity of point are initialized:
Wherein,For province SkThe centrad of website m,For province SkThe quantity of website,For province SkWebsite m's
Degree further updates centrad according to PageRank algorithm iterations and is updated with following formula until tending to be steady:
Wherein, iter is the number of PageRank algorithm iterations, NI=500 be the total degree of PageRank algorithm iterations,For province S in the i-th ter iterationkThe centrad of website m,For province SkAll websites to province SkWebsite
M has the Website Hosting that optical cable connects,For with websiteThe centrad of j-th of website of connection,For websiteThe optical cable number with outer connection, α is damped coefficient;
Set of source data is built according to the website attribute in the larger each province of data volume:
Wherein, N is the quantity in the larger province of data volume,For SkA source data, SkA source data, that is, province SkIncludingA sample isA website:
Wherein,For province SkThe quantity of quantity, that is, sample of website;For province SkThe website attribute of website m, Sk∈[1,
SN],SNFor the quantity in province,For province SkThe quantity of website,For province SkWebsite m
Type of site,For province SkThe website voltage class of website m,For province SkThe station of website m
Point scheduling grade,For province SkThe website of website m builds up the time limit,For province SkLight in the website of website m
Transmission device quantity,For province SkThe website said system of website m,For province SkThe website center of website m
Degree;
By predicting province STWebsite attribute build target training set:
Wherein, nTProvince S is predicted for the number of samples of target training setTWebsite quantity,To predict province STWebsite i (i
∈[1,nT]) website attribute, that is, feature vector be:
Wherein,To predict province STThe type of site of website i,To predict province STWebsite i
Website voltage class,To predict province STThe website of website i dispatches grade,To predict province STIt stands
The website of point i builds up the time limit,To predict province STOptical transmission device quantity in the website of website i,It is pre-
Survey province STThe website said system of website i,To predict province STThe website centrad of website i;
Respectively to set of source data D and target training setDiscretization and normalization are carried out, normalization set of source data is obtainedWith
And normalization target training set
Count province S in set of source data DkCorresponding site traffic quantity obtain number of services data set and be:
Wherein, Sk∈[1,SN],For province SkWebsite quantity meter;
Count target training setIn i.e. province STCorresponding site traffic quantity obtain target service quantity training collection and be:
Wherein,For province SkWebsite quantity meter;
By number of services data set Y and target service quantity training collectionUsing the standardized normalization of min-max:
Wherein, min is to take set minimum value, and max is to take set maximum value, and y is number of services data set Y and target service quantity
Training setIn arbitrary province website quantity, number of services data set Y and target service quantity training collectionUsing min-
Normalization number of services data set is respectively obtained after the standardized normalization of maxWith normalization target service quantity training collection
3. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 2
It is by understanding that normalizing set of source data is described in step 1 to state the SVR models based on weight:
SkA normalization source data, that is, province SkIncludingA sample isA website:
It is according to normalization set of source data structure training dataset:
Wherein, SNFor the quantity of the quantity, that is, sample in province,For province SkQuantity, that is, training dataset of website's
Size,For training datasetMiddle province SkThe normalization number of services of website i,For training datasetMiddle province SkNormalization website attribute, that is, normalization characteristic vector of website i is:
Wherein, For province SkThe type of site of website m,For province Sk
The website voltage class of website m,For province SkThe website of website m dispatches grade,For province SkWebsite m
Website build up the time limit,For province SkOptical transmission device quantity in the website of website m,For province SkWebsite
The website said system of m,For province SkThe website centrad of website m;
To SkA normalization source dataIn each sample, that is, each website normalization attribute weight, weight isW-SVR models based on weight are:
Wherein, q is the weight parameter of model, and b is the straggling parameter of model;
The parametric solution process of w-SVR models based on weight is:
Defining linear ε insensitive loss function is:
Wherein, ε is insensitive loss value, as province SkThe normalization number of services of website iWith the prediction of regression estimates function
ValueBetween difference be less than ε, loss be equal to 0;
The present invention selects Radial basis kernel functionBy training datasetIt is non-linear
It transforms in another feature space, and constructs regression estimates function in feature space after Radial basis kernel function transformation, and just
Beginningization SkA normalization source dataIn weightRadial basis kernel function formula:
Wherein, σ2For training datasetVariance;
Weight coefficient is introduced in SVR models to control the influence of Singular variance, obtaining optimization aim is:
Wherein, ξiFor the first slack variable parameter, ξi' it is the second slack variable parameter, ε is insensitive loss value, and C is model ginseng
Number, q are the weight parameter of model, and b is the straggling parameter of model, is converted according to Lagrange and dual problem, optimization is asked
Topic is converted to:
Wherein, αiFor the first Lagrangian, α 'iFor the second Lagrangian, α is solvedi,α’iValue, while should expire
Sufficient KKT conditions, therefore have:
Find out Model Weight parameter q, straggling parameter b:
Wherein, 0 < αi,Finally obtain regressive prediction model:
4. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:
Each province S in source data D described in step 3kWebsite weights initialisation be:
Wherein, it is province SkSample number, that is, website quantity, target province STWebsite weights initialisation be:
Wherein,For target province website quantity;
Province SkWebsite weight by normalization obtain source data province SkNormalized weight vectorThe website weight of source data is in weighting multi-source TrAdaBoost algorithms:
Target province STWebsite weight normalized weight vector in target province is obtained by normalizationThe website weight in target province is in weighting multi-source TrAdaBoost algorithms:
Merge training dataset:
Wherein,To normalize set of source data described in step 1Middle province SkNormalization source data, N be province quantity,For:
Wherein,Each element be website normalization attribute,For province SkWebsite quantity, N be province quantity;
Wherein,To normalize number of services data set described in step 1Middle province SkNormalization number of services, N is province
Quantity,For:
Wherein,Each element be website normalization number of services,For province SkWebsite quantity, N be province number
Amount;
Wherein,To normalize target training set described in step 1:
Wherein,In each element be target province STNormalization attribute,For target province STWebsite quantity
Wherein,For the normalization target service quantity training collection of target province described in step 1:
Wherein,Each element be target province website normalization number of services,For the website quantity in target province.
5. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 4
Training dataset D will be merged by statingk,Yk, weighting multi-source TrAdaBoost algorithms in source data website weightAnd target saves
PartWebsite weight pass through step 2 build the SVR model sets based on weight:
Wherein,For province S in the t times iterationkSVR model of the k-th based on weight, N is that the quantity of source data saves
The quantity of part,For province S in the t times iterationkThe first Lagrangian of website i,For province S in the t times iterationk
The second Lagrangian of website i,For province S in the t times iterationkThe straggling parameter of website i,For province
SkThe Radial basis kernel function of website i;
Calculate prediction modelIn normalization target training setAnd normalization target service quantity training collection
Error in t iteration:
Wherein,For target province S in the t times iterationTThe normalized weight of website i,Obtained target province STIt stands
The number of services predicted value of point i,It is target province STNumber of services, that is, actual value of website i, according to errorUpdate prediction
ModelWeight:
Finally, the candidate prediction model h of the t times iteration is obtainedt:
Meanwhile calculating candidate prediction model htIn target detection data DT,YTOn error, wt,iFor target province data station
Weight:
Parameter phi for updating sample weights is sett:
Wherein, εtError for the model obtained when the t times iteration updates the weight of target data sample:
Wherein,For target province S in the t times iterationTThe weight of website i,Obtained target province STThe industry of website i
Business quantitative forecast value,It is target province STNumber of services, that is, actual value of website i, ε are insensitive loss value,For target
The website quantity in province;
Update the weight of each regional source data sample:
Wherein,For the t times iteration source data province SkThe weight of website i,The site traffic number that the t times iteration obtains
Predicted value is measured,It is site traffic quantity actual value, ε is insensitive loss value,For the station number in province, parameter
For:
Wherein, M is maximum iteration, and t is current iteration number t ∈ [1, M], according to source data in step 1In,For the summation of each province website number.
6. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 5
It states and repeats step 4 to maximum iteration and calculating final prediction model and be:
Final prediction model f (x) is calculated if t=M:
Wherein, φtFor the parameter value generated in each iterative process, ht(x) it is the model generated in each iterative process.
7. the data verification method according to claim 1 based on multi-source transfer learning, it is characterised in that:Institute in step 6
It states for target province STWebsite i website attribute, that is, feature vector:
Model predication value isPredicted value executes renormalization operation:
Wherein, min is to take set minimum value, and max is to take set maximum value,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810320808.6A CN108549907B (en) | 2018-04-11 | 2018-04-11 | Data verification method based on multi-source transfer learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810320808.6A CN108549907B (en) | 2018-04-11 | 2018-04-11 | Data verification method based on multi-source transfer learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108549907A true CN108549907A (en) | 2018-09-18 |
CN108549907B CN108549907B (en) | 2021-11-16 |
Family
ID=63514421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810320808.6A Active CN108549907B (en) | 2018-04-11 | 2018-04-11 | Data verification method based on multi-source transfer learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108549907B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110398986A (en) * | 2019-04-28 | 2019-11-01 | 清华大学 | A kind of intensive woods cognition technology of unmanned plane of multi-source data migration |
CN110457646A (en) * | 2019-06-26 | 2019-11-15 | 中国政法大学 | One kind being based on parameter transfer learning low-resource head-position difficult labor personalized method |
CN110674648A (en) * | 2019-09-29 | 2020-01-10 | 厦门大学 | Neural network machine translation model based on iterative bidirectional migration |
WO2020168676A1 (en) * | 2019-02-21 | 2020-08-27 | 烽火通信科技股份有限公司 | Method for constructing network fault handling model, fault handling method and system |
CN112651173A (en) * | 2020-12-18 | 2021-04-13 | 浙江大学 | Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100069942A (en) * | 2008-12-17 | 2010-06-25 | 한양대학교 산학협력단 | Method for cooperative transmitting data in wireless multihop network and system thereof |
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
CN106296044A (en) * | 2016-10-08 | 2017-01-04 | 南方电网科学研究院有限责任公司 | power system risk scheduling method and system |
CN106651188A (en) * | 2016-12-27 | 2017-05-10 | 贵州电网有限责任公司贵阳供电局 | Electric transmission and transformation device multi-source state assessment data processing method and application thereof |
CN107818523A (en) * | 2017-11-14 | 2018-03-20 | 国网江西省电力公司信息通信分公司 | Power communication system data true value based on unstable frequency distribution and frequency factor study differentiates and estimating method |
-
2018
- 2018-04-11 CN CN201810320808.6A patent/CN108549907B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20100069942A (en) * | 2008-12-17 | 2010-06-25 | 한양대학교 산학협력단 | Method for cooperative transmitting data in wireless multihop network and system thereof |
CN104199857A (en) * | 2014-08-14 | 2014-12-10 | 西安交通大学 | Tax document hierarchical classification method based on multi-tag classification |
CN106296044A (en) * | 2016-10-08 | 2017-01-04 | 南方电网科学研究院有限责任公司 | power system risk scheduling method and system |
CN106651188A (en) * | 2016-12-27 | 2017-05-10 | 贵州电网有限责任公司贵阳供电局 | Electric transmission and transformation device multi-source state assessment data processing method and application thereof |
CN107818523A (en) * | 2017-11-14 | 2018-03-20 | 国网江西省电力公司信息通信分公司 | Power communication system data true value based on unstable frequency distribution and frequency factor study differentiates and estimating method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020168676A1 (en) * | 2019-02-21 | 2020-08-27 | 烽火通信科技股份有限公司 | Method for constructing network fault handling model, fault handling method and system |
CN110398986A (en) * | 2019-04-28 | 2019-11-01 | 清华大学 | A kind of intensive woods cognition technology of unmanned plane of multi-source data migration |
CN110457646A (en) * | 2019-06-26 | 2019-11-15 | 中国政法大学 | One kind being based on parameter transfer learning low-resource head-position difficult labor personalized method |
CN110457646B (en) * | 2019-06-26 | 2022-12-13 | 中国政法大学 | Low-resource head-related transfer function personalization method based on parameter migration learning |
CN110674648A (en) * | 2019-09-29 | 2020-01-10 | 厦门大学 | Neural network machine translation model based on iterative bidirectional migration |
CN110674648B (en) * | 2019-09-29 | 2021-04-27 | 厦门大学 | Neural network machine translation model based on iterative bidirectional migration |
CN112651173A (en) * | 2020-12-18 | 2021-04-13 | 浙江大学 | Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system |
CN112651173B (en) * | 2020-12-18 | 2022-04-29 | 浙江大学 | Agricultural product quality nondestructive testing method based on cross-domain spectral information and generalizable system |
Also Published As
Publication number | Publication date |
---|---|
CN108549907B (en) | 2021-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dong et al. | Hourly energy consumption prediction of an office building based on ensemble learning and energy consumption pattern classification | |
CN108549907A (en) | A kind of data verification method based on multi-source transfer learning | |
CN105117602B (en) | A kind of metering device running status method for early warning | |
CN109461025A (en) | A kind of electric energy substitution potential customers' prediction technique based on machine learning | |
Feng et al. | Design and simulation of human resource allocation model based on double-cycle neural network | |
CN109583635A (en) | A kind of short-term load forecasting modeling method towards operational reliability | |
CN108846691A (en) | Regional grain and oil market price monitoring analysing and predicting system and monitoring method | |
CN109102157A (en) | A kind of bank's work order worksheet processing method and system based on deep learning | |
CN110309884A (en) | Electricity consumption data anomalous identification system based on ubiquitous electric power Internet of Things net system | |
CN106067079A (en) | A kind of system and method for gray haze based on BP neutral net prediction | |
Wang et al. | Dealing with alarms in optical networks using an intelligent system | |
CN110990718A (en) | Social network model building module of company image improving system | |
Lv et al. | Media information dissemination model of wireless networks using deep residual network | |
CN102708298B (en) | A kind of Vehicular communication system electromagnetic compatibility index distribution method | |
CN110533341A (en) | A kind of Livable City evaluation method based on BP neural network | |
CN109889981A (en) | A kind of localization method and system based on two sorting techniques | |
Tang et al. | Leveraging socioeconomic information and deep learning for residential load pattern prediction | |
Dong et al. | Research on academic early warning model based on improved SVM algorithm | |
Ragapriya et al. | Machine Learning Based House Price Prediction Using Modified Extreme Boosting | |
Zhuang et al. | DyS-IENN: a novel multiclass imbalanced learning method for early warning of tardiness in rocket final assembly process | |
CN112016631A (en) | Improvement scheme related to low-voltage treatment | |
Liao et al. | Building energy efficiency assessment base on predict-center criterion under diversified conditions | |
Wang | SVR short-term traffic flow forecasting model based on spatial-temporal feature selection | |
CN111027845A (en) | Label model suitable for power market main part customer portrait | |
CN109886460A (en) | The prediction technique of tunnel subsidence time series based on adaboost |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |