CN106778795A - A kind of sorting technique and device based on incremental learning - Google Patents

A kind of sorting technique and device based on incremental learning Download PDF

Info

Publication number
CN106778795A
CN106778795A CN201510824421.0A CN201510824421A CN106778795A CN 106778795 A CN106778795 A CN 106778795A CN 201510824421 A CN201510824421 A CN 201510824421A CN 106778795 A CN106778795 A CN 106778795A
Authority
CN
China
Prior art keywords
data sample
forgetting factor
sample
grader
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510824421.0A
Other languages
Chinese (zh)
Inventor
王堃
杨丽
王元钢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Huawei Technologies Co Ltd
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd, Nanjing Post and Telecommunication University filed Critical Huawei Technologies Co Ltd
Priority to CN201510824421.0A priority Critical patent/CN106778795A/en
Publication of CN106778795A publication Critical patent/CN106778795A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a kind of sorting technique based on incremental learning, methods described includes:Build grader and determine characteristic of division vector;According to the grader and characteristic of division vector, the data sample to increasing newly is trained;According to the relevant parameter obtained by training, the data sample is deleted and retained.Accordingly, the embodiment of the invention also discloses a kind of sorter based on incremental learning.Using the present invention, it is possible to achieve in incremental learning, useless data and the data for remaining with actively are deleted, so as to improve the adaptability and accuracy rate of cognitive computation model.

Description

A kind of sorting technique and device based on incremental learning
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of sorting technique based on incremental learning and Device.
Background technology
The fast development of the emerging technologies such as cloud computing, Internet of Things promotes the scale of data just with unprecedented speed Degree increases, and the big data epoch have arrived.How timely and effectively to obtain valuable in the complex data of magnanimity The information of value, depend on one kind can Active Learning, it is and voluntarily valuable in analyze data according to demand Information, actively carries out model --- the cognitive computation model that data calculate treatment.It should be understood that, cognition is calculated Model can fast and accurately find valuable information, extract effective information and by these valuable letters Breath is organized, and provides an effective solution.
More universal cognitive computation model, SVMs (Support Vector are applied as a kind of Machine, SVM) sorting algorithm uses batch processing algorithm, and due to batch processing algorithm, need will be whole Individual data set is loaded among internal memory, and the amount of storage of internal memory is limited so that the algorithm is not suitable for big data Computing, it is impossible to be applied to real-time occasion higher.Meanwhile, can also cause the needs when grader is built Take a substantial amount of time, cause the data for reaching below not in time to cause to lose due to treatment, cause Serious consequence.
The content of the invention
The embodiment of the invention provides a kind of sorting technique based on incremental learning and device, it is possible to achieve increasing In amount study, useless data and the data for remaining with actively are deleted, so as to improve cognitive computation model Adaptability and accuracy rate.
Embodiment of the present invention first aspect provides a kind of sorting technique based on incremental learning, including:
Build grader and determine characteristic of division vector;
According to the grader and characteristic of division vector, the data sample to increasing newly is trained;
According to the relevant parameter obtained by training, the data sample is deleted and retained.
In the first possible implementation of first aspect, it is described structure grader and determine characteristic of division to Amount, including:
Determine the quantity of grader to be built;
Build each grader;
Determine the characteristic of division vector of each grader.
It is described special according to the grader and the classification in second possible implementation of first aspect Vector is levied, the data sample to increasing newly is trained, including:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
With reference to second possible implementation of first aspect, in the third possible implementation, described According to the relevant parameter obtained by training, the data sample is deleted and retained, including:
According to formula (1), formula (2) and formula (3), forgetting factor α is tried to achievei, wherein, αiRepresent number I-th ratio of data sample supporting vector SV, T after being trained through T times according to sampleiRepresent total frequency of training, ri Represent i-th number of times of the trained rear supporting vector SV of data sample, the test errors collection BerrInterior every number According to the r of samplei=0, the correct collection B of testokThe r of interior each data samplei=1;
According to based on the forgetting factor αiPrediction incremental learning mechanism, the data sample is deleted And reservation.
With reference to the third possible implementation of first aspect, in the 4th kind of possible implementation, described According to based on the forgetting factor αiPrediction incremental learning mechanism, the data sample is deleted and retained, Including:
Tri- threshold values of β, γ and δ are set;
Compare the forgetting factor αiWith the magnitude relationship of β, γ and δ;
According to result of the comparison, the data sample is deleted and retained.
It is described to set in the 5th kind of possible implementation with reference to the 4th kind of possible implementation of first aspect Put after tri- threshold values of β, γ and δ, also include:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value, wherein, eiError weights are represented, P represents set threshold value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
With reference to the 4th kind of possible implementation of first aspect, in the 6th kind of possible implementation, described According to result of the comparison, the data sample is deleted and retained, including:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
Embodiment of the present invention second aspect provides a kind of sorter based on incremental learning, including:
Initialization module, for building grader and determining characteristic of division vector;
Data training module, for according to the grader and characteristic of division vector, to the data for increasing newly Sample is trained;
Data processing module, for according to the relevant parameter obtained by training, being deleted to the data sample Except and retain.
In the first possible implementation of second aspect, the initialization module, specifically for:
Determine the quantity of grader to be built;
Build each grader;
Determine the characteristic of division vector of each grader.
In second possible implementation of second aspect, the data training module, specifically for:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
With reference to second possible implementation of second aspect, in the third possible implementation, the number Include according to processing module:
Parameter calculation unit, for according to formula (1), formula (2) and formula (3), trying to achieve forgetting factor αi, wherein, αiRepresent data sample i-th ratio of data sample supporting vector SV, T after T trainingi Represent total frequency of training, riRepresent i-th number of times of the trained rear supporting vector SV of data sample, the survey Examination Error Set BerrThe r of interior each data samplei=0, the correct collection B of testokThe r of interior each data samplei=1;
Data processing unit, for according to based on the forgetting factor αiPrediction incremental learning mechanism, to institute Data sample is stated to be deleted and retained.
With reference to the third possible implementation of second aspect, in the 4th kind of possible implementation, the number According to processing unit, specifically for:
Tri- threshold values of β, γ and δ are set;
Compare the forgetting factor αiWith the magnitude relationship of β, γ and δ;
According to result of the comparison, the data sample is deleted and retained.
With reference to the 4th kind of possible implementation of second aspect, in the 5th kind of possible implementation, the number Also include threshold adjustment unit according to processing module, be used for:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value, wherein, eiError weights are represented, P represents set threshold value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
With reference to the 4th kind of possible implementation of second aspect, in the 6th kind of possible implementation, the number According to processing unit, also particularly useful for:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
Therefore, the embodiment of the present invention first builds grader and determines characteristic of division vector, further according to classification Device and characteristic of division vector are trained to the data sample for increasing newly, and then according to the correlation ginseng obtained by training It is several that data sample is deleted and retained, it is possible to achieve in incremental learning, actively delete useless data And the data for remaining with, so as to improve the adaptability and accuracy rate of cognitive computation model.
Brief description of the drawings
In order to illustrate more clearly the embodiments of the present invention, below will be to needed for embodiment or description of the prior art The accompanying drawing to be used is briefly described, it should be apparent that, drawings in the following description are only of the invention Some embodiments, for those of ordinary skill in the art, on the premise of not paying creative work, Other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of sorting technique based on incremental learning provided in an embodiment of the present invention;
Fig. 2 is the schematic flow sheet of another sorting technique based on incremental learning provided in an embodiment of the present invention;
Fig. 3 is a kind of structural representation of sorter based on incremental learning provided in an embodiment of the present invention;
Fig. 4 is a kind of structural representation of data processing module provided in an embodiment of the present invention;
Fig. 5 is the structural representation of another sorter based on incremental learning provided in an embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu, it is fully described by, it is clear that described embodiment is only a part of embodiment of the invention, rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation Property work under the premise of the every other embodiment that is obtained, belong to the scope of protection of the invention.
Sorting technique based on incremental learning provided in an embodiment of the present invention, is applied to the learning training of big data Algorithm, except can apply to batch data process field, such as bio-identification, signal identification and detection, The multiple fields such as image recognition, can also be applied to classification, recurrence, the cluster, example of magnanimity real-time stream Such as portable medical in economic data classification analysis, the Internet of things system in real-time securities trading and ecommerce The analysis of the real-time streams such as real-time diagnosis, the Real-Time Traffic Volume prediction of wisdom traffic.
Fig. 1 is a kind of schematic flow sheet of the sorting technique based on incremental learning in the embodiment of the present invention.As schemed The flow of the sorting technique based on incremental learning in shown the present embodiment can include:
S101, builds grader and determines characteristic of division vector.
Specifically, the data sample amount for processing as needed, determines the quantity of grader to be built, and then The grader of respective numbers is built, and determines the characteristic of division vector of each grader.Wherein, training step It is as follows:
1. with grader by the tagsort of all set of data samples;
2. the relevant value of each characteristic vector in each grader is calculated respectively according to previously selected kernel function;
3. calculating covariance matrix space according to the relevant value for obtaining carries out Householder conversion;
4. characteristic of division coefficient is calculated;
5. the model parameter of grader is obtained.
S102, according to the grader and characteristic of division vector, the data sample to increasing newly is trained.
Specifically, according to grader and characteristic of division vector, the step of being trained to newly-increased data sample Including:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
S103, according to the relevant parameter obtained by training, is deleted and is retained to the data sample.
Specifically, first, according to formula (1), formula (2) and formula (3), trying to achieve forgetting factor αi, Wherein, αiRepresent data sample i-th ratio of data sample supporting vector SV, T after T trainingiRepresent Total frequency of training, riI-th number of times of the trained rear supporting vector SV of data sample is represented, the test is wrong Collect B by mistakeerrThe r of interior each data samplei=0, the i.e. data sample are normal data, and the test correctly collects Bok The r of interior each data samplei=1, the i.e. data sample are abnormal data.It is pointed out that by improved KKT (Karush Kuhn Tucker) theorem understands that formula (1) and formula (2) are to meet optimal super flat The condition in face.
Then, according to based on the forgetting factor αiPrediction incremental learning mechanism, the data sample is entered Row is deleted and retained.During implementing, first set tri- threshold values of β, γ and δ, such as β=0.3, γ=0.4, δ=0.7, then compare forgetting factor αiWith the magnitude relationship of β, γ and δ, and then according to result of the comparison, by pre- If rule is deleted and retained to data sample.Wherein, preset rules are as follows:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample, reason is, Newly-increased data sample is error sample, and these samples do not influence original grader, but may be new dividing Class, so being retained as a subset of the test sample of training next time;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample, reason is, It is very low by the ratio of data sample SV after repeatedly training, it is not a kind of new classification, institute by the data sample To be deleted, the training speed of intrusion detection is improve while the storage for reducing initial data;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested, the convergence rate that SV can be accelerated to search for;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
It is further alternative, after often being trained through 10 times, each data sample and institute are tried to achieve according to formula (4) If the error weights of threshold value, the forgetting factor α of Select Error maximum weightiAs new threshold value, and according to something lost Forget factor-alphai, adapt to the value of adjustment β, γ and δ.Wherein, eiError weights are represented, P represents set threshold value
ei=P- αi(1≤i≤10) (4)
Therefore, the embodiment of the present invention first builds grader and determines characteristic of division vector, further according to classification Device and characteristic of division vector are trained to the data sample for increasing newly, and then according to the correlation ginseng obtained by training It is several that data sample is deleted and retained, it is possible to achieve in incremental learning, actively delete useless data And the data for remaining with, so as to improve the adaptability and accuracy rate of cognitive computation model.
Fig. 2 is a kind of schematic flow sheet of the sorting technique based on incremental learning, the party in the embodiment of the present invention Method is applied to the intrusion detection scene of the network information, and the network flow data that can be directed to magnanimity carries out invasion inspection Survey.The flow of the sorting technique based on incremental learning in the present embodiment can include as shown in the figure:
S201, builds Attack Classification device and determines characteristic of division vector.
Specifically, the network flow data sample size for processing as needed, determines Attack Classification device to be built Quantity, and then build the Attack Classification device of respective numbers, and determine the characteristic of division of each Attack Classification device Vector.Wherein, training step is as follows:
1. Attack Classification device is used by the tagsort of all-network data on flows sample set;
2. the relevant of each characteristic vector in each Attack Classification device is calculated according to previously selected kernel function respectively Value;
3. calculating covariance matrix space according to the relevant value for obtaining carries out Householder conversion;
4. characteristic of division coefficient is calculated;
5. the model parameter of Attack Classification device is obtained.
S202, sets tri- threshold values of β, γ and δ.
Such as β=0.3, γ=0.4, δ=0.7.
S203, according to the Attack Classification device and characteristic of division vector, to the network flow data for increasing newly Sample is trained, while asking for forgetting factor.
Specifically, according to Attack Classification device and characteristic of division vector, the network flow data sample to increasing newly enters The step of row training, includes:
1. by stochastic gradient descent SGD algorithms, newly-increased network flow data sample set B is randomly selected1 It is trained;
2. initial Attack Classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judgement Result is by the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new a collection of network flow data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original network flow data sampleWith the sample Subset B1Merge to obtain new setWith new attack grader Γ2, and by the setIn Except the setNetwork flow data sample outside rest network data on flows sample it is correct with the test Collection BokMerge to obtain the Attack Classification device Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
Further, according to formula (1), formula (2) and formula (3), forgetting factor α is tried to achievei, wherein, αiRepresent network flow data sample i-th network flow data sample supporting vector SV after T training Ratio, TiRepresent total frequency of training, riRepresent i-th trained rear supporting vector SV of network flow data sample Number of times, the test errors collection BerrThe r of interior each network flow data samplei=0, i.e. the network traffics number It is proper network data on flows according to sample, the test correctly collects BokInterior each network flow data sample ri=1, i.e. the network flow data sample are attack network flow data.It is pointed out that by improved KKT (Karush Kuhn Tucker) theorem understands that formula (1) and formula (2) are to meet optimal super flat The condition in face.
S204, asks for the error weights of each network flow data sample and set threshold value.
Specifically, the error weights of each network flow data sample and set threshold value are tried to achieve according to formula (4), Wherein, eiError weights are represented, P represents set threshold value.
ei=P- αi(1≤i≤10) (4)
S205, selects the forgetting factor of the error maximum weight as new threshold value.
S206, according to the forgetting factor, adapts to the value of adjustment β, γ and δ.
The magnitude relationship of S207, relatively more described forgetting factor and β, γ and δ.
S208, according to result of the comparison, is deleted and is retained to the network flow data sample.
Specifically, according to result of the comparison, network flow data sample is deleted and protected by preset rules Stay.Wherein, preset rules are as follows:
As the forgetting factor αiWhen=0, retain the forgetting factor αiMap network data on flows sample, it is former Because being that newly-increased network flow data sample is error sample, and these samples do not influence original Attack Classification Device, but may be new classification, it is possible to it is a kind of new attack classification, so being retained as next instruction The a subset of experienced test sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediMap network data on flows sample, Reason is, very low by the ratio of network flow data sample SV after repeatedly training, by the network traffics number It is not a kind of new attack classification according to sample, so being deleted, reduces the storage of original network traffic data While improve the training speed of intrusion detection;
As the forgetting factor β≤αiDuring < δ, network flow data sample of the selection more than γ is used as next time Network flow data sample set is tested, the convergence rate that SV can be accelerated to search for;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiMap network data on flows sample conduct Network flow data sample set next time is tested, and reason is that the network flow data sample is to attack Hit data sample.
Therefore, the embodiment of the present invention first builds Attack Classification device and determines characteristic of division vector, further according to Attack Classification device and characteristic of division vector are trained to the network flow data sample for increasing newly, and then according to instruction Relevant parameter obtained by practicing is deleted and retained to network flow data sample, it is possible to achieve in increment In habit, useless network flow data and the network flow data for remaining with actively are deleted, so as to improve The adaptability and accuracy rate of cognitive computation model.
Fig. 3 is a kind of structural representation of the sorter based on incremental learning in the embodiment of the present invention.As schemed The sorter based on incremental learning in the shown embodiment of the present invention can at least include initialization module 310, Data training module 320 and data processing module 330, wherein:
Initialization module 310, for building grader and determining characteristic of division vector.
Specifically, the data sample amount for processing as needed, determines the quantity of grader to be built, and then The grader of respective numbers is built, and determines the characteristic of division vector of each grader.Wherein, training step It is as follows:
1. with grader by the tagsort of all set of data samples;
2. the relevant value of each characteristic vector in each grader is calculated respectively according to previously selected kernel function;
3. calculating covariance matrix space according to the relevant value for obtaining carries out Householder conversion;
4. characteristic of division coefficient is calculated;
5. the model parameter of grader is obtained.
Data training module 320, for according to the grader and characteristic of division vector, to the number for increasing newly It is trained according to sample.
Specifically, according to grader and characteristic of division vector, the step of being trained to newly-increased data sample Including:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
Data processing module 330, for according to the relevant parameter obtained by training, being carried out to the data sample Delete and retain.In implementing, data processing module 330 can as shown in Figure 4 further include parameter Computing unit 331 and data processing unit 332, wherein:
Parameter calculation unit 331, for according to formula (1), formula (2) and formula (3), trying to achieve forgetting Factor-alphai, wherein, αiRepresent data sample i-th ratio of data sample supporting vector SV after T training Rate, TiRepresent total frequency of training, riI-th number of times of the trained rear supporting vector SV of data sample is represented, The test errors collection BerrThe r of interior each data samplei=0, the correct collection B of testokInterior each data sample Ri=1.
Data processing unit 332, for according to based on the forgetting factor αiPrediction incremental learning mechanism, it is right The data sample is deleted and retained.During implementing, tri- threshold values of β, γ and δ are first set, Such as β=0.3, γ=0.4, δ=0.7, then compare forgetting factor αiWith the magnitude relationship of β, γ and δ, Jin Ergen According to result of the comparison, data sample is deleted and retained by preset rules.Wherein, preset rules are as follows:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample, reason is, Newly-increased data sample is error sample, and these samples do not influence original grader, but may be new dividing Class, so being retained as a subset of the test sample of training next time;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample, reason is, It is very low by the ratio of data sample SV after repeatedly training, it is not a kind of new classification, institute by the data sample To be deleted, the training speed of intrusion detection is improve while the storage for reducing initial data;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested, the convergence rate that SV can be accelerated to search for;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
Fig. 4 is referred to, data processing module 330 can also include threshold adjustment unit 333 as shown in the figure, use In:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
Fig. 5 is the structural representation of sorter of the another kind based on incremental learning in the embodiment of the present invention, As shown in figure 5, the sorter that should be based on incremental learning can include:At least one processor 501, for example CPU, at least one communication bus 502, at least one network interface 503, memory 504.Wherein, lead to Letter bus 502 is used to realize the connection communication between these components;Memory 504 can be that high-speed RAM is deposited Reservoir, or non-volatile memory (non-volatile memory), for example, at least one disk storage Device.Optionally, memory 504 can also be at least one storage dress for being located remotely from aforementioned processor 501 Put.Batch processing code is stored in memory 504, processor 501 is used to call storage in memory x04 Program code, perform following operation:
Build grader and determine characteristic of division vector;
According to the grader and characteristic of division vector, the data sample to increasing newly is trained;
According to the relevant parameter obtained by training, the data sample is deleted and retained.
Optionally, processor 501 builds grader and determines that the vectorial concrete operations of characteristic of division are:
Determine the quantity of grader to be built;
Build each grader;
Determine the characteristic of division vector of each grader.
Again optional, processor 501 is vectorial according to the grader and the characteristic of division, to the number for increasing newly The concrete operations being trained according to sample are:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B1′;
Repeat above-mentioned 1., 2. and 3. three steps.
Further, processor 501 is carried out according to the relevant parameter obtained by training to the data sample Delete and the concrete operations of reservation are:
According to formula (1), formula (2) and formula (3), forgetting factor α is tried to achievei, wherein, αiRepresent number I-th ratio of data sample supporting vector SV, T after being trained through T times according to sampleiRepresent total frequency of training, ri Represent i-th number of times of the trained rear supporting vector SV of data sample, the test errors collection BerrInterior every number According to the r of samplei=0, the correct collection B of testokThe r of interior each data samplei=1;
According to based on the forgetting factor αiPrediction incremental learning mechanism, the data sample is deleted And reservation.
Further, processor 501 is according to based on the forgetting factor αiPrediction incremental learning mechanism, The concrete operations that the data sample is deleted and retained are:
Tri- threshold values of β, γ and δ are set;
Compare the forgetting factor αiWith the magnitude relationship of β, γ and δ;
According to result of the comparison, the data sample is deleted and retained.
Optionally, after processor 501 sets tri- threshold values of β, γ and δ, also perform:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
Again optional, processor 501 is deleted and retained according to result of the comparison to the data sample Concrete operations be:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
Therefore, the embodiment of the present invention first builds grader and determines characteristic of division vector, further according to classification Device and characteristic of division vector are trained to the data sample for increasing newly, and then according to the correlation ginseng obtained by training It is several that data sample is deleted and retained, it is possible to achieve in incremental learning, actively delete useless data And the data for remaining with, so as to improve the adaptability and accuracy rate of cognitive computation model.
One of ordinary skill in the art will appreciate that all or part of flow in realizing above-described embodiment method, Computer program be can be by instruct the hardware of correlation to complete, described program can be stored in a calculating In machine read/write memory medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method. Wherein, described storage medium can for magnetic disc, CD, read-only memory (Read-Only Memory, ) or random access memory (Random Access Memory, RAM) etc. ROM.
Above disclosed is only present pre-ferred embodiments, can not limit the present invention's with this certainly Interest field, therefore the equivalent variations made according to the claims in the present invention, still belong to the scope that the present invention is covered.

Claims (14)

1. a kind of sorting technique based on incremental learning, it is characterised in that methods described includes:
Build grader and determine characteristic of division vector;
According to the grader and characteristic of division vector, the data sample to increasing newly is trained;
According to the relevant parameter obtained by training, the data sample is deleted and retained.
2. the method for claim 1, it is characterised in that the structure grader simultaneously determines that classification is special Vector is levied, including:
Determine the quantity of grader to be built;
Build each grader;
Determine the characteristic of division vector of each grader.
3. the method for claim 1, it is characterised in that described according to the grader and described point Category feature vector, the data sample to increasing newly is trained, including:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B '1
Repeat above-mentioned 1., 2. and 3. three steps.
4. method as claimed in claim 3, it is characterised in that the correlation ginseng according to obtained by training Number, is deleted and is retained to the data sample, including:
According to formula (1), formula (2) and formula (3), forgetting factor α is tried to achievei, wherein, αiRepresent number I-th ratio of data sample supporting vector SV, T after being trained through T times according to sampleiRepresent total frequency of training, ri Represent i-th number of times of the trained rear supporting vector SV of data sample, the test errors collection BerrInterior every number According to the r of samplei=0, the correct collection B of testokThe r of interior each data samplei=1;
ω = Σ i α i y i H ( x i ) - - - ( 1 )
0 ≤ α i ≤ 1 , Σ i α i y i = 0 - - - ( 2 )
α i = r i T i - - - ( 3 )
According to based on the forgetting factor αiPrediction incremental learning mechanism, the data sample is deleted And reservation.
5. method as claimed in claim 4, it is characterised in that the basis is based on the forgetting factor αi Prediction incremental learning mechanism, the data sample is deleted and retained, including:
Tri- threshold values of β, γ and δ are set;
Compare the forgetting factor αiWith the magnitude relationship of β, γ and δ;
According to result of the comparison, the data sample is deleted and retained.
6. method as claimed in claim 5, it is characterised in that tri- threshold values of described setting β, γ and δ Afterwards, also include:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value, wherein, eiError weights are represented, P represents set threshold value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
7. method as claimed in claim 5, it is characterised in that described according to result of the comparison, to described Data sample is deleted and retained, including:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
8. a kind of sorter based on incremental learning, it is characterised in that described device includes:
Initialization module, for building grader and determining characteristic of division vector;
Data training module, for according to the grader and characteristic of division vector, to the data for increasing newly Sample is trained;
Data processing module, for according to the relevant parameter obtained by training, being deleted to the data sample Except and retain.
9. device as claimed in claim 8, it is characterised in that the initialization module, specifically for:
Determine the quantity of grader to be built;
Build each grader;
Determine the characteristic of division vector of each grader.
10. device as claimed in claim 8, it is characterised in that the data training module, it is specific to use In:
1. by stochastic gradient descent SGD algorithms, newly-increased data sample subset B is randomly selected1It is trained;
2. preliminary classification device Γ is passed through1Judge the sample set B1The correctness of classification, and according to judged result By the sample set B1It is divided into test errors collection BerrCollection B correct with testok
3. the test errors collection B is judgederrWhether it is empty set,
If so, new batch of data sample is then extracted by the SGD algorithms be trained,
If it is not, then by the set of supporting vector SV in original data sampleWith the sample set B1Enter Row merges to obtain new setWith new grader Γ2, and by the setIn remove the set Data sample outside remaining data sample collection B correct with the testokMerge to obtain the grader Γ2Incremental data sample set B '1
Repeat above-mentioned 1., 2. and 3. three steps.
11. devices as claimed in claim 10, it is characterised in that the data processing module includes:
Parameter calculation unit, for according to formula (1), formula (2) and formula (3), trying to achieve forgetting factor αi, wherein, αiRepresent data sample i-th ratio of data sample supporting vector SV, T after T trainingi Represent total frequency of training, riRepresent i-th number of times of the trained rear supporting vector SV of data sample, the survey Examination Error Set BerrThe r of interior each data samplei=0, the correct collection B of testokThe r of interior each data samplei=1;
ω = Σ i α i y i H ( x i ) - - - ( 1 )
0 ≤ α i ≤ 1 , Σ i α i y i = 0 - - - ( 2 )
α i = r i T i - - - ( 3 )
Data processing unit, for according to based on the forgetting factor αiPrediction incremental learning mechanism, to institute Data sample is stated to be deleted and retained.
12. devices as claimed in claim 11, it is characterised in that the data processing unit, it is specific to use In:
Tri- threshold values of β, γ and δ are set;
Compare the forgetting factor αiWith the magnitude relationship of β, γ and δ;
According to result of the comparison, the data sample is deleted and retained.
13. devices as claimed in claim 12, it is characterised in that the data processing module also includes threshold Value adjustment unit, is used for:
After often being trained through 10 times, each data sample is tried to achieve according to formula (4) and is weighed with the error of set threshold value Value, wherein, eiError weights are represented, P represents set threshold value;
ei=P- αi(1≤i≤10) (4)
Select the forgetting factor α of the error maximum weightiAs new threshold value;
According to the forgetting factor αi, adapt to the value of adjustment β, γ and δ.
14. devices as claimed in claim 12, it is characterised in that the data processing unit, it is also specific For:
As the forgetting factor αiWhen=0, retain the forgetting factor αiCorresponding data sample;
As the < α of the forgetting factor 0iDuring < β, the forgetting factor α is deletediCorresponding data sample;
As the forgetting factor β≤αiDuring < δ, data sample of the selection more than γ is used as data sample next time Collection is tested;
As the forgetting factor δ < αiDuring < 1, by the forgetting factor αiCorresponding data sample is used as next time Set of data samples is tested.
CN201510824421.0A 2015-11-24 2015-11-24 A kind of sorting technique and device based on incremental learning Pending CN106778795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510824421.0A CN106778795A (en) 2015-11-24 2015-11-24 A kind of sorting technique and device based on incremental learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510824421.0A CN106778795A (en) 2015-11-24 2015-11-24 A kind of sorting technique and device based on incremental learning

Publications (1)

Publication Number Publication Date
CN106778795A true CN106778795A (en) 2017-05-31

Family

ID=58964157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510824421.0A Pending CN106778795A (en) 2015-11-24 2015-11-24 A kind of sorting technique and device based on incremental learning

Country Status (1)

Country Link
CN (1) CN106778795A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107508866A (en) * 2017-08-08 2017-12-22 重庆大学 Reduce the method for the transmission consumption of mobile device end neural network model renewal
CN108347430A (en) * 2018-01-05 2018-07-31 国网山东省电力公司济宁供电公司 Network invasion monitoring based on deep learning and vulnerability scanning method and device
CN108537227A (en) * 2018-03-21 2018-09-14 华中科技大学 A kind of offline false distinguishing method of commodity based on width study and wide-angle micro-image
CN109784044A (en) * 2017-11-10 2019-05-21 北京安码科技有限公司 A kind of Android malware recognition methods of the improvement SVM based on incremental learning
CN110011932A (en) * 2019-04-18 2019-07-12 清华大学深圳研究生院 A kind of the net flow assorted method and terminal device of recognizable unknown flow rate
CN110070060A (en) * 2019-04-26 2019-07-30 天津开发区精诺瀚海数据科技有限公司 A kind of method for diagnosing faults of bearing apparatus
CN111092894A (en) * 2019-12-23 2020-05-01 厦门服云信息科技有限公司 Webshell detection method based on incremental learning, terminal device and storage medium
CN111832839A (en) * 2020-07-24 2020-10-27 河北工业大学 Energy consumption prediction method based on sufficient incremental learning
CN115774854A (en) * 2023-01-30 2023-03-10 北京亿赛通科技发展有限责任公司 Text classification method and device, electronic equipment and storage medium
CN115952934A (en) * 2023-03-15 2023-04-11 华东交通大学 Traffic flow prediction method and system based on incremental output decomposition recurrent neural network

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107508866A (en) * 2017-08-08 2017-12-22 重庆大学 Reduce the method for the transmission consumption of mobile device end neural network model renewal
CN107508866B (en) * 2017-08-08 2020-10-02 重庆大学 Method for reducing transmission consumption of mobile equipment end neural network model updating
CN109784044A (en) * 2017-11-10 2019-05-21 北京安码科技有限公司 A kind of Android malware recognition methods of the improvement SVM based on incremental learning
CN108347430B (en) * 2018-01-05 2021-01-12 国网山东省电力公司济宁供电公司 Network intrusion detection and vulnerability scanning method and device based on deep learning
CN108347430A (en) * 2018-01-05 2018-07-31 国网山东省电力公司济宁供电公司 Network invasion monitoring based on deep learning and vulnerability scanning method and device
CN108537227A (en) * 2018-03-21 2018-09-14 华中科技大学 A kind of offline false distinguishing method of commodity based on width study and wide-angle micro-image
CN110011932A (en) * 2019-04-18 2019-07-12 清华大学深圳研究生院 A kind of the net flow assorted method and terminal device of recognizable unknown flow rate
CN110011932B (en) * 2019-04-18 2022-04-05 清华大学深圳研究生院 Network traffic classification method capable of identifying unknown traffic and terminal equipment
CN110070060A (en) * 2019-04-26 2019-07-30 天津开发区精诺瀚海数据科技有限公司 A kind of method for diagnosing faults of bearing apparatus
CN111092894A (en) * 2019-12-23 2020-05-01 厦门服云信息科技有限公司 Webshell detection method based on incremental learning, terminal device and storage medium
CN111832839A (en) * 2020-07-24 2020-10-27 河北工业大学 Energy consumption prediction method based on sufficient incremental learning
CN115774854A (en) * 2023-01-30 2023-03-10 北京亿赛通科技发展有限责任公司 Text classification method and device, electronic equipment and storage medium
CN115952934A (en) * 2023-03-15 2023-04-11 华东交通大学 Traffic flow prediction method and system based on incremental output decomposition recurrent neural network
CN115952934B (en) * 2023-03-15 2023-06-16 华东交通大学 Traffic flow prediction method and system based on incremental output decomposition cyclic neural network

Similar Documents

Publication Publication Date Title
CN106778795A (en) A kind of sorting technique and device based on incremental learning
CN109891508B (en) Single cell type detection method, device, apparatus and storage medium
CN110020592A (en) Object detection model training method, device, computer equipment and storage medium
CN106201871A (en) Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
WO2018072580A1 (en) Method for detecting illegal transaction and apparatus
CN106897792A (en) A kind of structural fire protection risk class Forecasting Methodology and system
CN107545038A (en) A kind of file classification method and equipment
CN109636212B (en) Method for predicting actual running time of job
CN108900622A (en) Data fusion method, device and computer readable storage medium based on Internet of Things
CN111191836A (en) Well leakage prediction method, device and equipment
CN112596964A (en) Disk failure prediction method and device
CN111753461A (en) Tidal water level correction method, target residual water level acquisition method, device and equipment
CN106855844A (en) A kind of performance test methods and system
CN116994077A (en) Regression prediction method for flight attitude under action of complex wind field
KR20220049573A (en) Distance-based learning trust model
CN111461329A (en) Model training method, device, equipment and readable storage medium
CN108021774B (en) Data processing method and device
CN113128598B (en) Sensing data detection method, device, equipment and readable storage medium
CN115618928A (en) Slope displacement prediction method and device and electronic equipment
CN116415836A (en) Security evaluation method for intelligent power grid information system
CN115392582A (en) Crop yield prediction method based on incremental fuzzy rough set attribute reduction
CN115422821A (en) Data processing method and device for rock mass parameter prediction
CN108021900A (en) Space of a whole page subfield method and device
CN111882135B (en) Internet of things equipment intrusion detection method and related device
CN107067036A (en) A kind of ground net corrosion rate prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170531