CN107526805B

CN107526805B - ML-kNN multi-tag Chinese text classification method based on weight

Info

Publication number: CN107526805B
Application number: CN201710724115.9A
Authority: CN
Inventors: 姜明; 张旻; 杜炼; 汤景凡; 程柳
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2017-08-22
Filing date: 2017-08-22
Publication date: 2019-12-24
Anticipated expiration: 2037-08-22
Also published as: CN107526805A

Abstract

The invention discloses an ML-kNN multi-label Chinese text classification method based on weight. The invention mainly aims to solve the problem that when the ML-kNN algorithm is adopted to classify multi-label Chinese texts, the misjudgment or incomplete judgment of an example label set is not shown easily caused by the condition of unbalanced number of various labels in a training set or the condition of uneven distribution of training samples in a space. The technical scheme adopted by the invention is that all adjacent labels are endowed with corresponding correction weights in a local range according to the proportion of the number of various labels in a training set, and then different weights are endowed to the labels according to mutual information of spatial distribution of an unseen example and a training example at the stage of making a decision on the unseen example label set. Meanwhile, in order to improve the classification efficiency, the text is subjected to certain pre-classification treatment before the text is formally classified, so that the classification efficiency of the multi-label Chinese text can be effectively improved.

Description

ML-kNN multi-tag Chinese text classification method based on weight

Technical Field

The invention relates to the field of text classification, in particular to an ML-kNN multi-label Chinese text classification method based on weight.

Background

The multi-tag problem is a common phenomenon in the real world, e.g., in the text category, a news article may contain several predefined topics such as "education", "sports" at the same time; in the picture category, a certain picture may have multiple scenes such as "field", "mountain range" and the like; in bioinformatics, a gene may have multiple functions such as "metabolism", "transcription", and "protein synthesis" simultaneously; in the audio category, a certain piece of music may belong to various categories such as "happy", "happy"; in the video category, a movie may belong to multiple categories such as "drama", "love". Therefore, research on multi-label classification is led, and the multi-label classification aims to learn a multi-label classifier according to a given training example and a corresponding class label set. Based on this, for any example to be classified, the classifier can predict the set of labels to which the example corresponds.

The classifier can be regarded as a learning problem and the task is to build a learner. The learner may assign the given examples to be classified to their corresponding categories. However, since the to-be-classified example can be associated with multiple classes simultaneously, it is not a two-or multi-classification problem, it is a multi-label classification problem.

The conventional multi-label classification algorithm ML-kNN adopts a classification criterion of k-nearest neighbors (k-nearest neighbors), counts class label information of neighbor samples, and infers a label set of an unseen example in a mode of maximizing posterior probability (MAP), wherein under the conditions that the number of class labels contained in a training example is unbalanced and the distribution of the training example containing the class labels in a space is uneven, misjudgment or incomplete judgment of the unseen example label set can be caused.

Disclosure of Invention

The invention aims to provide a weight-based ML-kNN multi-label Chinese text classification method aiming at the defects of the prior art, so as to solve the problem that the label quantity and the training example distribution are not considered when the traditional ML-kNN algorithm classifies multi-label Chinese texts.

A method for classifying ML-kNN multi-label Chinese texts based on weights comprises the following steps:

step 1, performing pre-classification treatment on a text to be classified;

step 2, calculating a weight factor corresponding to the problem of unbalanced label quantity;

step 3, calculating a correction weight factor proposed for the distribution of the training examples in the space;

step 4, calculating an event H based on the weight factor_jThe prior probabilities of being true and false are respectively recordedAndH_jrepresenting unseen example x with class label y_jThis event;

step 5, calculating H_jIs establishedIf not, C in N (x)_jEach sample having a class label y_jAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, C_jCounting y in N (x)_jThe number of samples as their associated markers;

step 6, combining the results calculated in the steps 4 and 5, and obtaining a required multi-label classifier based on an ML-kNN algorithm;

and 7, combining the preprocessing result obtained in the step 1 and the classifier obtained in the step 6 to classify the unclassified texts.

The text to be classified in the step 1 is pre-classified, and the process is as follows:

1.1, firstly determining all category names, and using all category names as an original category label set;

1.2 using all text data and the latest Chinese Wikipedia corpus in the training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to be subjected to word segmentation processing, then to remove stop words and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which may have practical meaning;

1.3 expansion of class Mark sets: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2 in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, and using the word2vec to add the words in the corpus, which have similarity greater than 0.9 with the category in the original category label set, into the category label set so as to achieve the purpose of expanding the category label set, so that the expanded category label has more category representation capability;

1.4 due to the characteristics of Chinese text, when a category name appears in the text, the text is necessarily related to the category; therefore, the expanded category label set is retrieved, all texts to be classified are traversed, and the texts are labeled with the corresponding category labels.

The calculation of the weighting factor corresponding to the problem of unbalanced label number in the step 2 comprises the following steps:

2.1 count the number of samples containing l tags, i.e.Where m is the number of training samples,the value of the flag is 1 when the example i exists, and the value is 0 when the example i does not exist;

2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y₁,y₂,…,y_qRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;

2.3 for the problem of classification error caused by unbalanced label quantity, defining the weighting factor of l label as

The calculation described in step 3 distributes the proposed correction weight factors in space for the training examples as follows:

3.1 since the standard deviation can reflect the overall spatial distribution of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;

3.2 calculating the local label density ρ of the instances where the label set contains/labels in k examples from the spatial distribution of the k nearest neighbor examples of the location where the unseen example is located_l；

3.3 obtaining k-most samples of unseen sample pairs by using mutual information of spatial distribution between unseen sample and k nearest neighbor samples in training setThe influence intensity sequence of the local label density influence intensity of each label in the adjacent example label set from low to high; the specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rho_l', the influence strength of the l label on the local label density is calculated in a mode of

3.4 calculate the proposed local label density impact weights for the training examples distributed in space:where σ is the correction coefficient for the weight.

The invention has the following advantages and beneficial effects:

aiming at the problem of multi-label classification, the invention eliminates the defects of the ML-kNN algorithm in multi-label Chinese text classification by considering the quantity distribution condition of each class of labels and the distribution condition of the training examples in the space, and improves the multi-label classification effect. Meanwhile, before formal classification, the classification texts are subjected to pre-classification processing by constructing and expanding a class mark set, so that the efficiency of multi-label Chinese text classification can be greatly improved.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, a method for classifying ML-kNN multi-tag chinese texts based on weights includes the following steps:

1) the text to be classified is pre-classified, and the processing process is as follows:

1.1) firstly determining all category names, and using all category names as an original category label set;

1.2) taking all text data and the latest Chinese Wikipedia corpus in a training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to word segmentation processing, then to remove stop words (adverbs, prepositions and the like) and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which can possibly have meanings;

1.3) expanding the class mark set: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2) in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, so that the words with the similarity of more than 0.9 to the category in the original category label set in the corpus are added into the category label set by using the word2vec in the invention, thereby achieving the purpose of expanding the category label set and enabling the expanded category label to have more category representation capability;

1.4) due to the characteristics of Chinese text, when a category name appears in the text, the text must be related to the category. And retrieving the expanded category label set, traversing all texts to be classified, and labeling the texts with corresponding category labels.

2) Calculating a weight factor corresponding to the problem of unbalanced label quantity, wherein the process is as follows:

2.1) count the number of samples containing the l-tag, i.e.Where m is the number of training samples,the value is 1 when the i flag of the example i exists (i.e. the example i tag set contains the i tag), otherwise the value is 0;

2.2) calculating the average number of various labels in the training sample label setWherein γ ═ y₁,y₂,…,y_qRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;

2.3) number of tagsThe problem of classification error caused by imbalance is defined as the weighting factor of l labelSince the weighting factors of various labels may be relatively different, acting on the global environment will have a relatively large influence on the classification effect, so that a method for weighting the example with no local k nearest neighbors is selected.

3) The proposed corrective weight factors for the spatial distribution of the training examples are calculated as follows:

3.1) the standard deviation can reflect the overall spatial distribution condition of the example, so the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;

3.2) calculating the local label density rho of the example of the label set containing l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position where the unseen example is located_l；

3.3) obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set. The specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rho_l', the influence strength of the l label on the local label density is calculated in a mode of

3.4) calculate the proposed local label density impact weights for the training examples distribution in space:where σ is the correction coefficient for the weight.

4) Considering the weight factor w calculated in step 2)_numCalculate event H_jIs formed and is not formedThe vertical prior probabilities, respectivelyAndH_jrepresenting unseen example x with class label y_jThis event;

5) calculate H_jIf true or false, there is C in N (x)_jEach sample having a class label y_jAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, C_jCounting y in N (x)_jThe number of samples as their associated markers;

6) combining the results calculated in the steps 4) and 5), calculating by means of Bayes theorem based on ML-kNN algorithmThe required multi-label classifier can be obtained:

7) before classifying the text to be classified by using the classifier obtained in the step 6), directly skipping the labels already obtained in the pre-classification result obtained in the step 1), and then judging other undetermined labels.

Claims

1. A method for classifying ML-kNN multi-label Chinese texts based on weights is characterized by comprising the following steps:

step 1, performing pre-classification treatment on a text to be classified;

step 4, calculating an event H_jThe prior probabilities of being true and false are respectively recordedAndH_jrepresenting unseen instance x with class label y_jThis event;

step 5, calculating H_jIf true or false, there is C in N (x)_jEach sample having a class label y_jAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, C_jCounting y in N (x)_jThe number of samples as their associated markers;

and 7, combining the preprocessing result obtained in the step 1 and the multi-label classifier obtained in the step 6 to classify the unclassified texts.

2. The method of claim 1, wherein the step 1 of pre-classifying the text to be classified comprises the following steps:

3. The method of claim 2, wherein the step 2 of calculating the weighting factors corresponding to the problem of unbalanced label number comprises the following steps:

2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y₁,y₂,…,y_qRepresents a label space containing q classes, | γ | represents a class in the label spaceThe number of the groups is q;

4. The method of claim 3, wherein the step 3 of calculating the proposed modified weighting factors for the spatial distribution of training examples comprises the following steps:

3.1 because the standard deviation can reflect the overall spatial distribution condition of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;

3.2 calculating the local label density rho of the example of which the label set contains l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position of the unseen example_l；

3.3, obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set; the specific calculation mode of the local label density influence strength of the label l is as follows: when l labels exist in the unseen example label set, the new local label density of the l labels is rho_l' the influence strength of the l label of the unseen example on the local label density is calculated in a mode

3.4 calculate the proposed corrective weight factors for the spatial distribution of the training instances:where σ is the correction coefficient for the weight.