CN107133293A

CN107133293A - A kind of ML kNN improved methods and system classified suitable for multi-tag

Info

Publication number: CN107133293A
Application number: CN201710278015.8A
Authority: CN
Inventors: 刘鹏鹤; 孙晓平; 孙毓忠
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-04-25
Filing date: 2017-04-25
Publication date: 2017-09-05

Abstract

The present invention relates to a kind of ML kNN improved methods and system classified suitable for multi-tag, including：Statistics is concentrated per the total sample number of class label in initial data, it is used as exemplar number, the total sample number per category feature is counted in the sample of every class label, it is used as feature samples number, and feature tag weight is calculated according to the exemplar number and this feature sample number, wherein each feature one characteristic value of correspondence；Concentrate every sample to be split as multiple original single exemplars with single label the initial data, and the characteristic value of every original single exemplar is updated according to this feature label weight, generate the first data set；Sample to be tested to be predicted is obtained, sample to be tested is split as single exemplar to be measured with single label, the label of single exemplar to be measured is predicted successively according to first data set, the tag set of the sample to be tested is determined.Thus the invention enables sample multi-tag it is classificatory predict the outcome it is more accurate.

Description

A kind of ML-kNN improved methods and system classified suitable for multi-tag

Technical field

The present invention relates to machine learning field, more particularly to a kind of ML-kNN improved methods classified suitable for multi-tag and System.

Background technology

In traditional single labeling, learnt from a series of samples with a label l, wherein l From tag set L, | L |>1.If | L |=2, problem concerning study is just referred to as two classification problems；If | L |>2, problem concerning study is just It is classification problem more than one.However, in multi-tag classification, a sample often has several labels Y, wherein In reality, there are many labeling problems, such as text classification a, text may be both sport category and political class；And for example Medical diagnosis on disease medically a, patient often has multiple complications, and such as patient may have respiratory tract infection, branch gas simultaneously Three kinds of diseases of Guan Yan and pneumonia.Paper (Tsoumakas G, Katakis I.Multi-Label Classification:An Overview[J].International Journal ofData Warehousing&Mining,2010,3(3):1-13) will The method for solving multi-tag classification is classified as 2 classes, and a class is problem conversion method, another kind of, is algorithm adaptive method.The former it So referred to as problem conversion method, is that multi-tag classification problem is converted into one or more single labeling problems by it；The latter Referred to as algorithm adaptive method is then that learning algorithm is extended to adapt to multi-tag data set by its trial.Most common one kind is asked Inscribe conversion method (Boutell M R, Luo J, Shen X, et al.Learning multi-label scene classification☆[J].Pattern Recognition,2004,37(9):1757-1771.) it is by raw data set It is divided into | L | individual subdata D_l, the sample in each Sub Data Set is marked as l if it has l labels, otherwise marks It is designated asFollowed by | L | individual Sub Data Set training | L | individual two grader, when classifying to sample, then this is used respectively | L | individual two grader is predicted.Finally take this | L | the union that individual two grader predicts the outcome is as the sample predictions mark most planted Label set.On algorithm adaptive method, paper (ClareA, KingRD.KnowledgeDiscoveryinMulti-label Phenotype Data[J].Lecture Notes in Computer Science,2001,2168(2168):42-53.) will C4.5 algorithms are adapted in multi-tag data, and entropy calculation formula is modified to allow the leafy node on decision tree to have There are multiple labels.Paper (Schapire R E, Singer Y.BoosTexter:A Boosting-based System for Text Categorization[J].Machine Learning,2000,39(2):135-168.) to AdaBoost algorithms (Freund,Yoav,Schapire,Robert E.A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting[J].Journal of Computer&System Sciences,1997,55(1):119-139.) it is extended that to propose Adaboost.MH and Adaboost.MR more to be adapted to Labeling.,, should if Weak Classifier is output as just during one sample of prediction, it is considered to label l in Adaboost.MH Label is added in the prediction label set of sample, on the contrary then be not added with；In MR, the output of Weak Classifier is then used for tag set L is ranked up to determine final output.Paper (Godbole S, Sarawagi S.Discriminative Methods for Multi-labeled Classification[M]//Advances in Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2004:22-30.) to SVMs (SupportVector Machine) have made some improvements to support multi-tag data, the distance of distance classification hyperplane is less than certain threshold value by it first Negative training sample remove, then the Negative training sample close with positive sample that tests out empirical tests collection is removed again.Paper (Thabtah F A,Cowling P,Peng Y.MMAC:A New Multi-Class,Multi-Label Associative Classification Approach[C]//IEEE International Conference on Data Mining.IEEE,2004:MMAC algorithms 217-224.) are proposed for multi-tag problem, the algorithm utilizes association rule mining Method treatment classification rule structure, it learns an initial classifying rules collection first with association rules mining algorithm Close, then delete with these corresponding samples of rule, continue from remaining sample learning classifying rules, so repeatedly, until There is no new rule appearance.In these rules, there is a situation where same prefix condition but different labels, so that these be advised Then merge into a multi-tag rule.ML-kNN(Zhang M L,Zhou Z H.A k-nearest neighbor based algorithm for multi-label classification[C]//IEEE International Conference on Granular Computing.IEEE,2005:It is 718-721Vol.2.) evolution of kNN Lazy learning algorithms to adapt to many Label data collection.ML-kNN each label l is used independently kNN algorithms (multi-tag k nearest neighbor algorithms)：To a test specimens This, its find out in training set with sample distance most close k neighbours and will the sample with label l be wherein positive sample This, remaining is then negative sample, so as to according to the statistical information of these neighbour's tally sets obtained, use maximum a posteriori probability Principle (MAP) goes to determine the tag set of test sample, and maximum a posteriori probability is based on kNN to being tested and posteriority before each label Probability.But ML-kNN still there are some shortcomings, the multi-tag characteristic of sample is primarily due to, it does not distinguish same sample Different labels corresponding to characteristic vector, i.e., to same sample, if its have several different labels, ML-kNN Method thinks that these labels have identical characteristic vector, and this discrimination resulted between label declines, and increases the mistake of classification Difference；Secondly, calculated in the distance of sample, ML-kNN uses classical cosine similarity as the measurement index of sample distance, This calculation does not consider the correlation between label, such as concentrates " Bronchopneumonia " and " branch gas in medical diagnostic data The two disease labels of Guan Yan " are that have stronger correlation, and the calculating that this correlation can adjust the distance brings one to be fixed Ring, this point ML-kNN method does not consider.

A kind of multi-tag sorting technique (CN104991974A) based on particle cluster algorithm of existing patent of invention, although the hair It is bright to have used kNN algorithms, but be to use particle cluster algorithm to the optimization of feature weight, and the priori of then feature based of the invention is general Rate is simultaneously updated to data set；In addition, the present invention adjusts the distance, calculation formula have also been made improvement.

A kind of existing multi-tag sorting technique of patent of invention and its device (CN104899596A) are although using transfer algorithm Multi-tag problem is converted into single labeling problem, but the present invention is several innovative points of proposition with its obvious difference, Calculated including feature weight, the sample that training dataset updates during with forecast sample is split, and these are all that the patent is not had 's.

The content of the invention

For above-mentioned ML-kNN deficiency, the present invention proposes a kind of ML-kNN improvement sides classified suitable for multi-tag Method, including following steps：

Step 1：Raw data set is obtained, the raw data set includes a plurality of sample, wherein every sample has many categories Label and multiclass feature, concentrate statistics per the total sample number of class label, as exemplar number, in every class label in the initial data Sample in count per category feature total sample number, as feature samples number, and according to the exemplar number and this feature sample Number calculates feature tag weight, wherein each feature one characteristic value of correspondence；

Step 2：Every sample is concentrated to be split as multiple original single exemplars with single label the initial data, And the characteristic value of every original single exemplar is updated according to this feature label weight, generate the first data set；

Step 3：Sample to be tested to be predicted is obtained, sample to be tested is split as to single label sample to be measured with single label This, is predicted to the label of single exemplar to be measured, determines the label of the sample to be tested successively according to first data set Set.

This is applied to the ML-kNN improved methods that multi-tag is classified, and the wherein step 3 includes：

Step 31：The label classification number C that the raw data set is related to altogether is counted, the unknown multi-tag sample is split as C Individual single exemplar to be measured；

Step 32：The characteristic value of C single exemplars to be measured is updated according to this feature label weight, generation the Two data sets；

Step 33：By calculating every single exemplar to be measured and every original in first data set in second data set Begin the distance between single exemplar, is followed successively by every single exemplar to be measured and predicts a prediction label, and is somebody's turn to do all Prediction label collection is combined into prediction label collection；

Step 34：The number of times that each label classification occurs is concentrated according to the prediction label, the tally set of the sample to be tested is determined Close.

This is applied to the ML-kNN improved methods that multi-tag is classified, and wherein Forecasting Methodology is specially in step 33：

Step 331：Calculate each original single exemplar in single exemplar to be measured and the first data set away from From；

Step 332：Taken out according to the distance away from the original single exemplar of the nearest k bars of single exemplar to be measured, and will The most tag class of occurrence number, is used as the prediction label of single exemplar to be measured in the original single exemplar of the k bars.

This is applied to the ML-kNN improved methods that multi-tag is classified, and wherein step 34 also includes：Each prediction label is calculated to go out Existing posterior probability, if the posterior probability is more than or equal to predetermined threshold value, the prediction label is added in result tag set, and Using the result tag set as the sample to be tested tag set.

This be applied to multi-tag classify ML-kNN improved methods, wherein in step 1 span of this feature value for 0, 1}。

The invention also provides a kind of ML-kNN improvement systems classified suitable for multi-tag, including following module：

Feature tag weight computation module：For obtaining raw data set, the raw data set includes a plurality of sample, wherein Every sample has multiclass label and multiclass feature, concentrates statistics per the total sample number of class label in the initial data, is used as mark Sample number is signed, the total sample number per category feature is counted in the sample of every class label, as feature samples number, and according to the label Sample number and this feature sample number calculate feature tag weight, wherein each feature one characteristic value of correspondence；

Sample splits module：Every sample is concentrated to be split as multiple original single marks with single label the initial data Signed-off sample sheet, and the characteristic value of every original single exemplar is updated according to this feature label weight, generation first is counted According to collection；

Sample to be tested prediction module：Sample to be tested to be predicted is obtained, the sample to be tested is split as with single label Single exemplar to be measured, the label of single exemplar to be measured is predicted successively according to first data set, it is determined that should The tag set of sample to be tested.

This is applied to the ML-kNN improvement systems that multi-tag is classified, and the wherein sample to be tested prediction module includes：

Statistical module：The label classification number C that the raw data set is related to altogether is counted, the unknown multi-tag sample is split as C single exemplars to be measured；

Update module：The characteristic value of C single exemplars to be measured is updated according to this feature label weight, generated Second data set；

Tag Estimation module：By calculating in second data set in every single exemplar to be measured and first data set The distance between every original single exemplar, is followed successively by every single exemplar to be measured and predicts a prediction label, and will All prediction label collection are combined into prediction label collection；

Tag set module：The number of times that each label classification occurs is concentrated according to the prediction label, the sample to be tested is determined Tag set.

This is applied to the ML-kNN improvement systems that multi-tag is classified, and the wherein Tag Estimation module includes：

Distance calculation module：Calculate single exemplar to be measured and each original single exemplar in the first data set Distance；

Screening module：Taken out away from the original single exemplar of the nearest k bars of single exemplar to be measured, and sieved according to the distance The most tag class of occurrence number in the original single exemplar of the k bars is selected, the pre- mark of single exemplar to be measured is used as Label.

This is applied to the ML-kNN improvement systems that multi-tag is classified, and the wherein tag set module also includes：Calculate each pre- Mark checks out existing posterior probability, if the posterior probability is more than or equal to predetermined threshold value, the prediction label is added into result label In set, and using the result tag set as the sample to be tested tag set.

This is applied to the ML-kNN improvement systems that multi-tag is classified, wherein this feature value in feature tag weight computation module Span be { 0,1 }.

The present invention is respectively from feature tag weight calculation, and training set updates, and three parts of sample predictions are to ML-kNN algorithms Modify extension, concretely, this method takes into full account influence of the different characteristic to different labels first, calculates feature tag Weight；Secondly every multi-tag sample in training set is split into multiple single exemplars and using feature tag weight to every The characteristic value of bar list exemplar is updated to distinguish the characteristic vector corresponding to different labels；Finally, it is pre- in unknown sample In survey, this method equally to unknown sample carry out split renewal and sample is updated apart from calculation formula with embody label it Between correlation so that this method multi-tag classification on obtain more preferable effect.

Brief description of the drawings

Fig. 1 is that training data of the present invention updates schematic diagram；

Fig. 2 is forecast sample schematic diagram of the present invention.

Embodiment

The present invention provides a kind of ML-kNN improved methods classified suitable for multi-tag, and target is to improve ML-kNN marking more Sign classificatory performance.

To achieve the above object, the technical solution adopted by the present invention is as follows：

Step 1：Raw data set is obtained, the raw data set includes a plurality of sample, and is related to C label classification altogether, its In every sample there is multiclass label and multiclass feature, concentrate total sample number of the statistics per class label in the initial data, as Exemplar number, counts the total sample number per category feature, as feature samples number, and according to the mark in the sample of every class label Sign sample number and this feature sample number calculates feature tag weight, wherein each feature one characteristic value of correspondence.Calculate feature mark Weight is signed, its concrete methods of realizing comprises the following steps：

Step 11：Given raw data set T, T={ (x₁, y₁),(x₂, y₂),…,(x_i, y_i), wherein x_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T, x_i ^(j)It is j-th of feature of i-th of sample, x_i ^(j)∈{a_j1,a_j2,…,a_jk, a_jkIt is that j-th of feature may K-th of the value taken, wherein j and k span are positive integer, j=1,2 ..., m, k=1,2 ..., S_j；Wherein { l₁,l₂,…,l_CRepresent the tag set that initial data concentrates all samples to have, C= 1,2 ..., C, C represent the different class label of common C kinds, l_CRepresent C kind labels.

Step 12：Feature tag weight is calculated according to below equation, wherein λ is smoothing factor, λ>0 to prevent weight For 0 situation；Wherein j=1,2 ..., m；k =1,2 ..., S_j；C=1,2 ..., C；λ>0, Y represent be sample label, I () represent 0-1 indicator functions, if in bracket Content is set up, then functional value is 1, is otherwise 0, P (X^(j)=a_jk| Y=l_C) it is to represent that in class label be l_CLower feature X^(j)Take It is worth for a_jkConditional probability, i.e. this feature label weight.

Step 2：Every sample is concentrated to be split as multiple original single exemplars with single label the initial data, And the characteristic value of every original single exemplar is updated according to this feature label weight, generate the first data set.It is former Beginning data set updates, as shown in figure 1, implementation method is as follows：

Step 21：Receive raw data set T and each feature tag weight P (X^(j)=a_jk| Y=l_C)；

Step 22：Successively by multi-tag sample x_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T,According to it Number of tags number be split as some single exemplars, as shown by the following formula：

Step 23：To every single exemplar (x after fractionation_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T, y_iC=l_C), according to it The class label being had, its characteristic value is updated using feature tag weight, generates the first data set, each characteristic value Update mode is as follows：

(x_i ^(j))=ω₁a_jk+ω₂P(X^(j)=a_jk| Y=l_C)

Wherein, ω₁And ω₂For coefficient, determined by experiment.

Step 3：Sample predictions, give the sample to be tested of a unknown multi-tag set, predict its tag set, including obtain Sample to be tested, single exemplar to be measured with single label is split as by sample to be tested, right successively according to first data set The label of single exemplar to be measured is predicted, and determines the tag set of the sample to be tested.Implementation method is as follows：

Step 32：With step 23, according to the characteristic value of this feature label weight single exemplar to be measured by C bars according to step Formula in rapid 23 is updated, and generates the second data set；

Step 33：By calculating every single exemplar to be measured and every original in first data set in second data set Begin the distance between single exemplar, is followed successively by every single exemplar to be measured and predicts a label, generates prediction label collection, Forecasting Methodology includes：

Step 331：Certain single exemplar to be measured and each sample in first data set are calculated according to below equation Distance:

dist(x_i,x_j) that represent is sample x_iAnd x_jThe distance between, it is with sim (x_i,x_j) change dullness pass Subtraction function, sim (x_i,x_j) it is then that, for seeking the similarity between two samples, molecule is the inner product of two sample vectors, point Mother is the product of two sample vector length, and span is [- 1 ,+1], and value is bigger, illustrates that two samples are more similar, σ is one It is individual be more than 1 constant, for the variation tendency in adjustment function.Due to dist (x_i,x_j) it is with sim (x_i,x_j) change list Adjust decreasing function, thus sim (x_i,x_j) more big then dist (x_i,x_j) smaller, that is to say, that its more similar distance of two samples is more Small, more dissimilar, distance is then more remote.In addition, the range formula also has certain embodiment to the correlation between label, by index Function characteristic can be seen that sim (x_i,x_j) compare on interval [- 1,0] and cause function dist (x on interval [0 ,+1]_i,x_j) Variation tendency is big, and this also just illustrates that two samples are more similar, and the variation tendency of its distance is just smaller, and the label of sample also may It is more related.Range formula of the prior art simply represents distance with sim (x, y) function, and the improvement of the present invention is Project it onto in exponential function space, preferably to react the correlation between sample label.

Step 332：The k bar original single exemplar nearest apart from single exemplar to be measured is taken out, and the k bars is original The most tag class of occurrence number in single exemplar, is used as the prediction label of single exemplar to be measured.

Step 34：The number of times that each label classification occurs is concentrated according to the prediction label, the tally set of the sample to be tested is determined Close, as shown in Fig. 2 method includes：

Step 341：The number of times that the prediction label concentrates each label classification to occur is counted, and each is calculated according to below equation The posterior probability that prediction label occurs：

Wherein T is the raw data set in step 11, h (x_iv) it is sample x to be predicted_iThe C bars single label to be measured split into The prediction label of the v bars sample in sample；

Step 342：Judge the posterior probability P (l of each label_C| T) whether it is more than or equal to predetermined threshold valueIf so, then will The prediction label is added in result tag set, otherwise, is not then considered.

Step 343：The result tag set output that step 343 is generated, as the tag set of the sample to be tested, also It is predicting the outcome for the sample to be tested.

The ML-kNN improvement sides proposed by the present invention classified suitable for multi-tag will be described in further detail below according to diagram The implementation steps of method：

Step 1：Feature tag weight is calculated, its implementation includes：

Step 11：Given raw data set T, T={ (x₁, y₁),(x₂, y₂),…,(x_i, y_i), wherein x_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T, x_i ^(j)It is j-th of feature of i-th of sample, x_i ^(j)∈{a_j1,a_j2,…,a_jS_j, a_jkIt is that j-th of feature can K-th of the value that can be taken, wherein j and k span are positive integer, j=1,2 ..., m；K=1,2 ..., S_j；Wherein { l₁,l₂,…,l_CRepresent the tag set that initial data concentrates all samples to have, C tables Show the different class label of common C kinds, l_CRepresent C kind labels.

As on medical diagnostic data collection, having 1990, sample, every sample has cough, generated heat, and vomiting etc. is a series of Feature, the span of these features is { 0,1 }, and every sample has the disease being had in several disease labels, data set The sick total class number of label is 72 kinds, including Bronchopneumonia, respiratory tract infection etc..

Step 12：Feature tag weight is calculated according to below equation, wherein λ is smoothing factor, it is 0 to prevent weight Situation, in specific implementation, smooth using Laplce, λ takes 1, why takes 1 to be taken equivalent in each of each stochastic variable Counted in the frequency of value once, for example, it is assumed that concentrated in actual sample data, it is all that there is " respiratory tract infection " label Sample may all not have " to vomit " feature, then if being added without λ, feature " vomiting " value is 1 relative to label " respiratory tract The feature tag weight of infection " is then 0, and this will influence whether follow-up calculating.And actual sample is limited after all, λ takes 1 Then to have done a Bayesian Estimation, equivalent to say feature " vomiting " value for 1 in the sample with " respiratory tract infection " extremely Occur less once.

Wherein

J=1,2 ..., m；K=1,2 ..., S_j；C=1,2 ..., C；λ>0, Y represent is label that sample has, I () represents 0-1 indicator functions, if content is set up in bracket, functional value is 1, is otherwise 0.P(X^(j)=a_jk| Y=l_C) be Represent that in class label be l_CLower feature X^(j)Value is a_jkConditional probability, that is, feature tag weight described above.

Such as on medical diagnostic data collection, it is assumed that with the bronchopneumonic exemplar number of label be 100, at this It is 80 with the feature samples number that feature is coughed in 100 Bronchopneumonia exemplars, and feature cough span is { 0,1 }, then according to above-mentioned formula, feature tag weight of the feature cough on Bronchopneumonia this label beAnd feature is not coughed, the feature tag weight on Bronchopneumonia this label is

Step 2：Raw data set updates, and as best seen in figure 1, implementation method is as follows：

Step 21：Given raw data set T and each feature tag weight P (X^(j)=a_jk| Y=l_C)；

Step 22：Successively by multi-tag sample x_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T,According to it Number of tags number be split as some single exemplars, if that is, there is multi-tag sample α label to be just split as α list label Sample, as shown by the following formula：

Step 23：To every single exemplar (x after fractionation_i=(x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ^(j))^T, y_iC=l_C), according to it The class label being had, is updated the value of its feature using feature tag weight, generates the be made up of single exemplar One data set, the value update mode of each feature is as follows：

(x_i ^(j))=ω₁a_jk+ω₂P(X^(j)=a_jk| Y=l_C)

Wherein, ω₁And ω₂For coefficient, determined by experiment.

It is assumed here that raw data set sample characteristics are { cough, do not generate heat }, its tag set for bronchitis, Bronchopneumonia }, the characteristic vector after formalization is x_i=(1,0).The feature tag weight calculated known to again according to step 1 It is as shown in the table:

	Bronchitis	Bronchopneumonia
			Cough	0.8922	0.7941
Do not cough	0.1078	0.2059
			Heating	0.3039	0.8922
Do not generate heat	0.7961	0.1078

If taking, ω₁=0, ω₂=1 multi-tag training sample { x_i=(1,0), y={ bronchitis, broncho-pulmonary It is scorching } }, it is split as following two single label training samples：

{x_i=(0.8922,0.7961), y=bronchitis },

{x_i=(0.7941,0.1078), y=Bronchopneumonia },

Then may production after being updated to 1990 multi-tag training samples that the medical diagnostic data in step 11 is concentrated Raw more single exemplars, for follow-up explanation, it is assumed here that just become after fractionation for 5000 single exemplars.

Step 3：Sample predictions, unknown to one and comprising multi-tag sample sample to be tested is predicted, and predicts that it may Tag set, as best seen in figure 2, implementation method is as follows：

Step 31：The unknown multi-tag sample is split as C by the total class number C. of label being had according to raw data set Single exemplar to be measured；Assuming that a unknown medical diagnosis sample is given, according in specific implementation step 11 medical treatment mentioned Diagnostic data set, then this unknown sample can be split as 72 unknown single exemplars to be measured first.

Step 32：With step 23, the characteristic value of single exemplar to be measured is updated according to the formula in step 23, Single exemplar i.e. to be measured to 72 in step 31 carries out feature renewal.

Step 33：Every single exemplar to be measured is followed successively by according to first data set and predicts a label, generation bag Prediction label collection containing C label, on medical diagnostic data collection, i.e., single exemplar to be measured to every is predicted respectively, Forecasting Methodology is as follows：

Step 331：Calculated according to below equation in certain single exemplar to be measured to be predicted and first data set The distance of each sample, on medical diagnostic data collection, i.e., every single exemplar to be measured and step 23 in calculation procedure 32 In the distance between 5000 original single exemplars：

dist(x_i,x_j) that represent is single exemplar x to be measured_iWith original single exemplar x_jThe distance between, its be with Sim (x_i,x_j) change monotonic decreasing function, sim (x_i,x_j) it is then for asking the similarity between two samples, molecule It is the inner product of two sample vectors, denominator is the product of two sample vector length, and span is [- 1 ,+1], and value is bigger, says Bright two samples are more similar.Due to dist (x_i,x_j) it is with sim (x_i,x_j) change monotonic decreasing function, thus sim (x_i, x_j) more big then dist (x_i,x_j) it is smaller, that is to say, that two samples it is more similar its apart from smaller, it is more dissimilar, distance it is then more remote. In addition, the range formula also has certain embodiment to the correlation between label, the sim it can be seen from exponential function characteristic (x_i,x_j) compare on interval [- 1,0] and cause function dist (x on interval [0 ,+1]_i,x_j) variation tendency it is bigger, this is also Illustrate that two samples are more similar, the variation tendency of its distance is just smaller, also may be more related between the label of sample.

Step 332：The k bar original single exemplar nearest apart from single exemplar to be measured is taken out, the k bars are counted original The label that single exemplar has, the most label of occurrence number in the label that the original single exemplar of the k bars has Species, the prediction label of single exemplar as to be measured.On medical diagnostic data collection, it is assumed that k takes 5, if apart from current The label of the single label training sample of nearest 5 of single exemplar to be measured is respectively [Bronchopneumonia, bronchitis, broncho-pulmonary Inflammation, respiratory tract infection, Bronchopneumonia], then the prediction label of current single exemplar to be measured is Bronchopneumonia.

Step 34：The determination of the tag set of sample to be predicted, method is as follows：

Step 341：It is each in the prediction label for C bars single exemplar to be measured that sample to be predicted is split into statistic procedure 33 The number of times that individual class label occurs, in medical diagnostic data, then has 72 and predicts the outcome, that is, count this 72 and predict the outcome In, the number of times that each disease label occurs.

Step 342：The posterior probability of each label appearance is calculated according to below equation：

Wherein T is the raw data set in step 11, h (x_iv) it is sample x to be predicted_iThe C bar list exemplars split into In the v bars sample prediction label, in medical diagnostic data, it is assumed that 72 obtained have 30 bronchuses in predicting the outcome Pneumonia, then disease Bronchopneumonia occur posterior probability be

Step 343：Judge the posterior probability P (l of each label_C| T) whether it is more than or equal to threshold valueIn implementation processValue Usually it is determined according to prediction effect, if P (l_C| T) it is more than or equal to threshold valueThe label is then added to the mark of sample to be predicted In label set, otherwise, then do not consider.

Step 344：The prediction label set output that step 343 is generated, is used as the pre- of original multi-tag sample to be predicted Survey result.On medical diagnostic data collection, if during 72 in step 33 predict the outcome, also 20 respiratory tract infection, 15 Individual bronchitis, 7 asthma, takesIt is worth for 0.2, then the tag set finally predicted is { Bronchopneumonia, respiratory tract sense Dye, bronchitis }.

It is below system embodiment corresponding with above method embodiment, present embodiment can be mutual with above-mentioned embodiment Coordinate and implement.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above-mentioned embodiment.

Although the present invention is disclosed with above-described embodiment, specific embodiment only to explain the present invention, is not used to limit The present invention, any those skilled in the art of the present technique without departing from the spirit and scope of the invention, can make the change and complete of some It is kind, therefore the scope of the present invention is defined by claims.

Claims

1. a kind of ML-kNN improved methods classified suitable for multi-tag, it is characterised in that comprise the following steps：

Step 1：Obtain raw data set, the raw data set include a plurality of sample, wherein every sample have multiclass label and Multiclass feature, concentrates statistics per the total sample number of class label, as exemplar number, in the sample of every class label in the initial data The total sample number per category feature is counted in this, as feature samples number, and according to the exemplar number and this feature number of samples Feature tag weight is calculated, wherein each feature one characteristic value of correspondence；

Step 2：Every sample is concentrated to be split as multiple original single exemplars with single label, and root the initial data The characteristic value of every original single exemplar is updated according to this feature label weight, the first data set is generated；

Step 3：Sample to be tested to be predicted is obtained, sample to be tested is split as single exemplar to be measured with single label, The label of single exemplar to be measured is predicted successively according to first data set, the tally set of the sample to be tested is determined Close.

2. the ML-kNN improved methods classified as claimed in claim 1 suitable for multi-tag, it is characterised in that the step 3 is wrapped Include：

Step 31：The label classification number C that the raw data set is related to altogether is counted, the unknown multi-tag sample is split as into C treats Survey single exemplar；

Step 32：The characteristic value of C single exemplars to be measured is updated according to this feature label weight, generation second is counted According to collection；

Step 33：By calculating every single exemplar to be measured and every original list in first data set in second data set The distance between exemplar, is followed successively by every single exemplar to be measured and predicts a prediction label, and by all predictions Tag set is prediction label collection；

Step 34：The number of times that each label classification occurs is concentrated according to the prediction label, the tag set of the sample to be tested is determined.

3. the ML-kNN improved methods classified as claimed in claim 2 suitable for multi-tag, it is characterised in that pre- in step 33 Survey method is specially：

Step 331：Calculate the distance of single exemplar to be measured and each original single exemplar in the first data set；

Step 332：Taken out according to the distance away from the original single exemplar of the nearest k bars of single exemplar to be measured, and by the k bars The most tag class of occurrence number in original single exemplar, is used as the prediction label of single exemplar to be measured.

4. the ML-kNN improved methods classified as claimed in claim 2 suitable for multi-tag, it is characterised in that step 34 is also wrapped Include：The posterior probability that each prediction label occurs is calculated, if the posterior probability is more than or equal to predetermined threshold value, the prediction label is added Enter in result tag set, and using the result tag set as the sample to be tested tag set.

5. the ML-kNN improved methods classified as claimed in claim 1 suitable for multi-tag, it is characterised in that should in step 1 The span of characteristic value is { 0,1 }.

6. a kind of improve system suitable for the ML-kNN that multi-tag is classified, it is characterised in that including following module：

Feature tag weight computation module：For obtaining raw data set, the raw data set includes a plurality of sample, wherein every Sample has multiclass label and multiclass feature, concentrates statistics per the total sample number of class label in the initial data, is used as label sample This number, counts the total sample number per category feature, as feature samples number, and according to the exemplar in the sample of every class label Number and this feature sample number calculate feature tag weight, wherein each feature one characteristic value of correspondence；

Sample splits module：Every sample is concentrated to be split as multiple original single label samples with single label the initial data This, and the characteristic value of every original single exemplar is updated according to this feature label weight, generate the first data set；

Sample to be tested prediction module：Sample to be tested to be predicted is obtained, the sample to be tested is split as treating with single label Single exemplar is surveyed, the label of single exemplar to be measured is predicted successively according to first data set, determines that this is to be measured The tag set of sample.

7. improve system suitable for the ML-kNN that multi-tag is classified as claimed in claim 6, it is characterised in that the sample to be tested Prediction module includes：

Statistical module：The label classification number C that the raw data set is related to altogether is counted, the unknown multi-tag sample is split as C Single exemplar to be measured；

Update module：The characteristic value of C single exemplars to be measured is updated according to this feature label weight, generation second Data set；

Tag Estimation module：By calculating in second data set every single exemplar to be measured and every in first data set The distance between original single exemplar, is followed successively by every single exemplar to be measured and predicts a prediction label, and will be all The prediction label collection is combined into prediction label collection；

Tag set module：The number of times that each label classification occurs is concentrated according to the prediction label, the label of the sample to be tested is determined Set.

8. improve system suitable for the ML-kNN that multi-tag is classified as claimed in claim 7, it is characterised in that the Tag Estimation Module includes：

Distance calculation module：Calculate each original single exemplar in single exemplar to be measured and the first data set away from From；

Screening module：Taken out away from the original single exemplar of the nearest k bars of single exemplar to be measured, and filtered out according to the distance The most tag class of occurrence number, is used as the prediction label of single exemplar to be measured in the original single exemplar of the k bars.

9. improve system suitable for the ML-kNN that multi-tag is classified as claimed in claim 7, it is characterised in that the tag set Module also includes：The posterior probability that each prediction label occurs is calculated, it is if the posterior probability is more than or equal to predetermined threshold value, this is pre- Mark label add result tag set in, and using the result tag set as the sample to be tested tag set.

10. improve system suitable for the ML-kNN that multi-tag is classified as claimed in claim 6, it is characterised in that feature tag The span of this feature value is { 0,1 } in weight computation module.