CN107526805A

CN107526805A - A kind of ML kNN multi-tag Chinese Text Categorizations based on weight

Info

Publication number: CN107526805A
Application number: CN201710724115.9A
Authority: CN
Inventors: 姜明; 张旻; 杜炼; 汤景凡; 程柳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2017-12-29
Anticipated expiration: 2037-08-22
Also published as: CN107526805B

Abstract

The invention discloses a kind of ML kNN multi-tag Chinese Text Categorizations based on weight.When the present invention is mainly for solving to classify to multi-tag Chinese text using ML kNN algorithms, the problem of situation pockety easily causes the erroneous judgement for having no example tag collection or imperfect judgement in space for the unbalanced situation of all kinds of number of labels in training set or training sample.The technical solution adopted by the present invention is, according to the proportion of all kinds of number of labels in training set corresponding amendment weight is assigned in subrange to the label of all neighbours, then, to having no that example tag collection carries out the stage of decision-making, the weight different to the imparting of each label from the mutual information of training example spatial distribution according to example is had no.Meanwhile in order to lift the efficiency of classification, the present invention has done certain processing of presorting before text Official Classification to text, can effectively lift the classification effectiveness of multi-tag Chinese text.

Description

A kind of ML-kNN multi-tag Chinese Text Categorizations based on weight

Technical field

The present invention relates to text classification field, and in particular to a kind of ML-kNN multi-tags Chinese text point based on weight Class method.

Background technology

Multi-tag problem is phenomenon common in a kind of real world, and e.g., in text categories, certain news release may Include several predetermined themes such as " education ", " physical culture " simultaneously；In picture classification, certain pictures may be simultaneously present a variety of Scene such as " field ", " mountain range "；In bioinformatics, a gene may have simultaneously multiple functions such as " metabolism ", " transcription " and " protein synthesis "；In audio categories, certain song may belong to plurality of classes such as " happy ", " height It is emerging "；In video classification, certain film may belong to plurality of classes such as " story of a play or opera ", " love " simultaneously.Thus more marks have been drawn The research of classification is signed, the target of multi-tag classification is according to the training example and its corresponding class label set provided, study Go out multi-tag grader.Based on this, for any example to be sorted, grader can predict the label corresponding to the example Set.

Grader can be used as a problem concerning study, and task is one learner of structure.Learner will can provide Example to be sorted assign to its corresponding classification.However, due to example to be sorted can simultaneously, institute associated with multiple classifications It is not one two classification or polytypic problem with it, it is a multi-tag classification problem.

Traditional multi-tag sorting algorithm ML-kNN algorithms are accurate using " k neighbours (k-nearest neighbors) " classification Then, the class label information of neighbour's sample is counted, by " maximizing posterior probability (maximum a posteriori, brief note For MAP) " mode reasoning have no the tag set of example, the quantity of its all kinds of distinguishing label included in training example is uneven Weigh, include the training examples in space in the case of skewness of all kinds of distinguishing labels, is likely to result in and has no example The erroneous judgement of tally set or imperfect judgement.

The content of the invention

The purpose of the present invention is in view of the shortcomings of the prior art, there is provided a kind of ML-kNN multi-tags Chinese text based on weight This sorting technique, with solve not accounting for when traditional ML-kNN algorithms classify to multi-tag Chinese text number of labels and The problem of training example is distributed.

A kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, comprise the following steps：

Step 1, processing of presorting is carried out to text to be sorted；

Step 2, calculate the weight factor being directed to corresponding to number of labels imbalance problem；

Step 3, calculating are distributed proposed amendment weight factor in space for training example；

Step 4, based on weight factor, calculate event H_jEstablishment and invalid prior probability, are designated as respectivelyWithH_jRepresentative has no that example x has category label y_jThis event；

Step 5, calculate H_jWhen setting up and be invalid, there is C in N (x)_jIndividual sample has category label y_jConditional probability, It is designated as respectivelyWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, C_jCount y in N (x)_jNumber of samples as its mark of correlation；

Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling classification Device；

Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is divided Class.

Processing of presorting is carried out to text to be sorted described in step 1, process is as follows：

1.1 determine all categories title first, and the other tag set of primitive class is referred to as using all categories name；

1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model, Wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then remove stop words with it is low Frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word；

1.3 expand classification tag set：Model word2vec is represented using term vector, by institute in corpus in step 1.2 Some words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically Similarity, with word2vec will in corpus with classification similarity in original classification tag set more than 0.9 word add Category label set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with category table Sign ability；

1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification； Therefore the category label set after retrieval is expanded, travels through all texts to be sorted, by class label corresponding to text mark.

Weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process are as follows：

2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark durations, and otherwise value is 0；

2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ ={ y₁,y₂,…,y_qThe label space for including q classification is represented, | γ | the number for representing classification in label space is q；

2.3 are directed to caused by number of labels imbalance the problem of error in classification, and the weight factor of definition l labels is

Calculating described in step 3 is distributed proposed amendment weight factor in space for training example, and process is such as Under：

3.1 can react the overall space distribution situation of example due to standard deviation, so being used in the present invention in local space Include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ；

3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection The partial tag density p of example comprising l labels_l；

3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, are had no Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high Sequence；The wherein label l specific calculation of partial tag Effects of Density intensity is：When have no example tag concentrate l marks be present During label, the new partial tag density of l labels is ρ_l', have no influence Strength co-mputation of the l labels of example to local label densities Mode is

3.4 calculating are distributed proposed partial tag Effects of Density weight in space for training example：Wherein σ is the correction factor of weight.

Advantages of the present invention and have the beneficial effect that：

The present invention is directed to multi-tag classification problem, because the distributed number situation and training that consider all kinds of distinguishing labels are shown The distribution situation of example in space, eliminates the defects of ML-kNN algorithms are existing in multi-tag Chinese Text Categorization, improves The effect of multi-tag classification.Meanwhile before Official Classification, by building and expanding classification tag set to classifying text Presort processing, can greatly improve the efficiency of multi-tag Chinese Text Categorization.

Brief description of the drawings

Fig. 1 is the FB(flow block) of the method for the invention.

Embodiment

The invention will be further described below in conjunction with the accompanying drawings.

A kind of reference picture 1, ML-kNN multi-tag Chinese Text Categorizations based on weight, comprises the following steps：

1) processing of presorting is carried out to text to be sorted, processing procedure is as follows：

1.1) all categories title is determined first, and the other tag set of primitive class is referred to as using all categories name；

1.2) language material using text data all in training set and newest Chinese wikipedia corpus as model Storehouse, wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, word segmentation processing is then done, then removes stop words (adverbial word, preposition etc.) and low-frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word；

1.3) classification tag set is expanded：Model word2vec is represented using term vector, by step 1.2) in corpus All words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semanteme On similarity, therefore will be big with classification similarity in original classification tag set in corpus with word2vec in the present invention Word in 0.9 adds category label set, to reach the purpose for expanding classification tag set so that the classification mark after expansion Note is more added with categorized representation ability；

1.4) due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable and this classification phase Close.Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.

2) weight factor corresponding to number of labels imbalance problem is calculated, process is as follows：

2.1) number of samples for including l labels is calculated, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark (including l labels in example i tally set) durations, and otherwise value is 0；

2.2) average of all kinds of number of labels in training sample tally set is calculatedWherein γ={ y₁,y₂,…,y_qThe label space for including q classification is represented, | γ | the number for representing classification in label space is q；

2.3) it is directed to caused by number of labels imbalance the problem of error in classification, the weight factor of definition l labels isBecause the weighted factor of all kinds of labels may differ by it is bigger, act on the overall situation Words can produce large effect to classifying quality, so selection has no the side that the example of k arest neighbors of example part is weighted Method.

3) calculate and be distributed proposed amendment weight factor in space for training example, process is as follows：

3.1) because standard deviation can react the overall space distribution situation of example, so using local space in the present invention In include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition is ρ；

3.2) according to the spatial distribution for k arest neighbors example for having no example present position, k paradigms label are calculated The partial tag density p of example of the collection comprising l labels_l；

3.3) using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, had no Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high Sequence.The wherein label l specific calculation of partial tag Effects of Density intensity is：When have no example tag concentrate l marks be present During label, the new partial tag density of l labels is ρ_l', have no influence Strength co-mputation of the l labels of example to local label densities Mode is

3.4) calculate and be distributed proposed partial tag Effects of Density weight in space for training example：Wherein σ is the correction factor of weight.

4) the weight factor w calculated in step 2) is considered_num, calculate event H_jEstablishment and invalid prior probability, point It is not designated asWithH_jRepresentative has no that example x has category label y_jThis event；

5) H is calculated_jWhen setting up and be invalid, there is C in N (x)_jIndividual sample has category label y_jConditional probability, remember respectively ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, C_jStatistics By y in N (x)_jNumber of samples as its mark of correlation；

6) step 4) and the result 5) calculated are combined, based on ML-kNN algorithms, is calculated by Bayes' theoremIt can obtain required multiple labeling grader：

7) treated in the grader drawn using step 6) before classifying text classified, first directly skip step 1) The label obtained in obtained result of presorting, then do not determine that label judges to other again.

Claims

1. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, it is characterised in that comprise the following steps：

Step 1, processing of presorting is carried out to text to be sorted；

Step 5, calculate H_jWhen setting up and be invalid, there is C in N (x)_jIndividual sample has category label y_jConditional probability, remember respectively ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, C_jStatistics By y in N (x)_jNumber of samples as its mark of correlation；

Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling grader；

Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is classified.

2. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 1, its feature exist In carrying out processing of presorting to text to be sorted described in step 1, process is as follows：

1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model, wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then removes stop words and low-frequency word, Retain noun, noun phrase, adjective, verb and other there may be sincere word；

1.3 expand classification tag set：Using term vector represent model word2vec, by step 1.2 in corpus it is all Word is represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically similar Degree, classification mark will be added with word of the classification similarity in original classification tag set more than 0.9 in corpus with word2vec Note set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with categorized representation ability；

1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification；Therefore Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.

3. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 2, its feature exist It is as follows in weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process：

2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,Table It is 1 to show that example i has l mark durations, and otherwise value is 0；

2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ={ y₁, y₂,…,y_qThe label space for including q classification is represented, | γ | the number for representing classification in label space is q；

4. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 3, its feature exist Itd is proposed amendment weight factor is distributed in space for training example in the calculating described in step 3, and process is as follows：

3.1 can react the overall space distribution situation of example due to standard deviation, so being included in the present invention with local space There are partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ；

3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection and include l The partial tag density p of the example of label_l；

3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, obtain having no example pair The influence sequence of intensity of the partial tag Effects of Density intensity of each label from low to high contained by k arest neighbors example tag collection；Its The middle label l specific calculation of partial tag Effects of Density intensity is：When having no that example tag concentration has l labels, l marks It is ρ to sign new partial tag density_l', have no that influence Strength co-mputation mode of the l labels of example to local label densities is