CN107526805B - ML-kNN multi-tag Chinese text classification method based on weight - Google Patents
ML-kNN multi-tag Chinese text classification method based on weight Download PDFInfo
- Publication number
- CN107526805B CN107526805B CN201710724115.9A CN201710724115A CN107526805B CN 107526805 B CN107526805 B CN 107526805B CN 201710724115 A CN201710724115 A CN 201710724115A CN 107526805 B CN107526805 B CN 107526805B
- Authority
- CN
- China
- Prior art keywords
- label
- category
- labels
- training
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an ML-kNN multi-label Chinese text classification method based on weight. The invention mainly aims to solve the problem that when the ML-kNN algorithm is adopted to classify multi-label Chinese texts, the misjudgment or incomplete judgment of an example label set is not shown easily caused by the condition of unbalanced number of various labels in a training set or the condition of uneven distribution of training samples in a space. The technical scheme adopted by the invention is that all adjacent labels are endowed with corresponding correction weights in a local range according to the proportion of the number of various labels in a training set, and then different weights are endowed to the labels according to mutual information of spatial distribution of an unseen example and a training example at the stage of making a decision on the unseen example label set. Meanwhile, in order to improve the classification efficiency, the text is subjected to certain pre-classification treatment before the text is formally classified, so that the classification efficiency of the multi-label Chinese text can be effectively improved.
Description
Technical Field
The invention relates to the field of text classification, in particular to an ML-kNN multi-label Chinese text classification method based on weight.
Background
The multi-tag problem is a common phenomenon in the real world, e.g., in the text category, a news article may contain several predefined topics such as "education", "sports" at the same time; in the picture category, a certain picture may have multiple scenes such as "field", "mountain range" and the like; in bioinformatics, a gene may have multiple functions such as "metabolism", "transcription", and "protein synthesis" simultaneously; in the audio category, a certain piece of music may belong to various categories such as "happy", "happy"; in the video category, a movie may belong to multiple categories such as "drama", "love". Therefore, research on multi-label classification is led, and the multi-label classification aims to learn a multi-label classifier according to a given training example and a corresponding class label set. Based on this, for any example to be classified, the classifier can predict the set of labels to which the example corresponds.
The classifier can be regarded as a learning problem and the task is to build a learner. The learner may assign the given examples to be classified to their corresponding categories. However, since the to-be-classified example can be associated with multiple classes simultaneously, it is not a two-or multi-classification problem, it is a multi-label classification problem.
The conventional multi-label classification algorithm ML-kNN adopts a classification criterion of k-nearest neighbors (k-nearest neighbors), counts class label information of neighbor samples, and infers a label set of an unseen example in a mode of maximizing posterior probability (MAP), wherein under the conditions that the number of class labels contained in a training example is unbalanced and the distribution of the training example containing the class labels in a space is uneven, misjudgment or incomplete judgment of the unseen example label set can be caused.
Disclosure of Invention
The invention aims to provide a weight-based ML-kNN multi-label Chinese text classification method aiming at the defects of the prior art, so as to solve the problem that the label quantity and the training example distribution are not considered when the traditional ML-kNN algorithm classifies multi-label Chinese texts.
A method for classifying ML-kNN multi-label Chinese texts based on weights comprises the following steps:
step 1, performing pre-classification treatment on a text to be classified;
step 2, calculating a weight factor corresponding to the problem of unbalanced label quantity;
step 3, calculating a correction weight factor proposed for the distribution of the training examples in the space;
step 4, calculating an event H based on the weight factorjThe prior probabilities of being true and false are respectively recordedAndHjrepresenting unseen example x with class label yjThis event;
step 5, calculating HjIs establishedIf not, C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
step 6, combining the results calculated in the steps 4 and 5, and obtaining a required multi-label classifier based on an ML-kNN algorithm;
and 7, combining the preprocessing result obtained in the step 1 and the classifier obtained in the step 6 to classify the unclassified texts.
The text to be classified in the step 1 is pre-classified, and the process is as follows:
1.1, firstly determining all category names, and using all category names as an original category label set;
1.2 using all text data and the latest Chinese Wikipedia corpus in the training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to be subjected to word segmentation processing, then to remove stop words and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which may have practical meaning;
1.3 expansion of class Mark sets: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2 in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, and using the word2vec to add the words in the corpus, which have similarity greater than 0.9 with the category in the original category label set, into the category label set so as to achieve the purpose of expanding the category label set, so that the expanded category label has more category representation capability;
1.4 due to the characteristics of Chinese text, when a category name appears in the text, the text is necessarily related to the category; therefore, the expanded category label set is retrieved, all texts to be classified are traversed, and the texts are labeled with the corresponding category labels.
The calculation of the weighting factor corresponding to the problem of unbalanced label number in the step 2 comprises the following steps:
2.1 count the number of samples containing l tags, i.e.Where m is the number of training samples,the value of the flag is 1 when the example i exists, and the value is 0 when the example i does not exist;
2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;
2.3 for the problem of classification error caused by unbalanced label quantity, defining the weighting factor of l label as
The calculation described in step 3 distributes the proposed correction weight factors in space for the training examples as follows:
3.1 since the standard deviation can reflect the overall spatial distribution of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2 calculating the local label density ρ of the instances where the label set contains/labels in k examples from the spatial distribution of the k nearest neighbor examples of the location where the unseen example is locatedl;
3.3 obtaining k-most samples of unseen sample pairs by using mutual information of spatial distribution between unseen sample and k nearest neighbor samples in training setThe influence intensity sequence of the local label density influence intensity of each label in the adjacent example label set from low to high; the specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rhol', the influence strength of the l label on the local label density is calculated in a mode of
3.4 calculate the proposed local label density impact weights for the training examples distributed in space:where σ is the correction coefficient for the weight.
The invention has the following advantages and beneficial effects:
aiming at the problem of multi-label classification, the invention eliminates the defects of the ML-kNN algorithm in multi-label Chinese text classification by considering the quantity distribution condition of each class of labels and the distribution condition of the training examples in the space, and improves the multi-label classification effect. Meanwhile, before formal classification, the classification texts are subjected to pre-classification processing by constructing and expanding a class mark set, so that the efficiency of multi-label Chinese text classification can be greatly improved.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a method for classifying ML-kNN multi-tag chinese texts based on weights includes the following steps:
1) the text to be classified is pre-classified, and the processing process is as follows:
1.1) firstly determining all category names, and using all category names as an original category label set;
1.2) taking all text data and the latest Chinese Wikipedia corpus in a training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to word segmentation processing, then to remove stop words (adverbs, prepositions and the like) and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which can possibly have meanings;
1.3) expanding the class mark set: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2) in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, so that the words with the similarity of more than 0.9 to the category in the original category label set in the corpus are added into the category label set by using the word2vec in the invention, thereby achieving the purpose of expanding the category label set and enabling the expanded category label to have more category representation capability;
1.4) due to the characteristics of Chinese text, when a category name appears in the text, the text must be related to the category. And retrieving the expanded category label set, traversing all texts to be classified, and labeling the texts with corresponding category labels.
2) Calculating a weight factor corresponding to the problem of unbalanced label quantity, wherein the process is as follows:
2.1) count the number of samples containing the l-tag, i.e.Where m is the number of training samples,the value is 1 when the i flag of the example i exists (i.e. the example i tag set contains the i tag), otherwise the value is 0;
2.2) calculating the average number of various labels in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;
2.3) number of tagsThe problem of classification error caused by imbalance is defined as the weighting factor of l labelSince the weighting factors of various labels may be relatively different, acting on the global environment will have a relatively large influence on the classification effect, so that a method for weighting the example with no local k nearest neighbors is selected.
3) The proposed corrective weight factors for the spatial distribution of the training examples are calculated as follows:
3.1) the standard deviation can reflect the overall spatial distribution condition of the example, so the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2) calculating the local label density rho of the example of the label set containing l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position where the unseen example is locatedl;
3.3) obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set. The specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rhol', the influence strength of the l label on the local label density is calculated in a mode of
3.4) calculate the proposed local label density impact weights for the training examples distribution in space:where σ is the correction coefficient for the weight.
4) Considering the weight factor w calculated in step 2)numCalculate event HjIs formed and is not formedThe vertical prior probabilities, respectivelyAndHjrepresenting unseen example x with class label yjThis event;
5) calculate HjIf true or false, there is C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
6) combining the results calculated in the steps 4) and 5), calculating by means of Bayes theorem based on ML-kNN algorithmThe required multi-label classifier can be obtained:
7) before classifying the text to be classified by using the classifier obtained in the step 6), directly skipping the labels already obtained in the pre-classification result obtained in the step 1), and then judging other undetermined labels.
Claims (4)
1. A method for classifying ML-kNN multi-label Chinese texts based on weights is characterized by comprising the following steps:
step 1, performing pre-classification treatment on a text to be classified;
step 2, calculating a weight factor corresponding to the problem of unbalanced label quantity;
step 3, calculating a correction weight factor proposed for the distribution of the training examples in the space;
step 4, calculating an event HjThe prior probabilities of being true and false are respectively recordedAndHjrepresenting unseen instance x with class label yjThis event;
step 5, calculating HjIf true or false, there is C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
step 6, combining the results calculated in the steps 4 and 5, and obtaining a required multi-label classifier based on an ML-kNN algorithm;
and 7, combining the preprocessing result obtained in the step 1 and the multi-label classifier obtained in the step 6 to classify the unclassified texts.
2. The method of claim 1, wherein the step 1 of pre-classifying the text to be classified comprises the following steps:
1.1, firstly determining all category names, and using all category names as an original category label set;
1.2 using all text data and the latest Chinese Wikipedia corpus in the training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to be subjected to word segmentation processing, then to remove stop words and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which may have practical meaning;
1.3 expansion of class Mark sets: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2 in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, and using the word2vec to add the words in the corpus, which have similarity greater than 0.9 with the category in the original category label set, into the category label set so as to achieve the purpose of expanding the category label set, so that the expanded category label has more category representation capability;
1.4 due to the characteristics of Chinese text, when a category name appears in the text, the text is necessarily related to the category; therefore, the expanded category label set is retrieved, all texts to be classified are traversed, and the texts are labeled with the corresponding category labels.
3. The method of claim 2, wherein the step 2 of calculating the weighting factors corresponding to the problem of unbalanced label number comprises the following steps:
2.1 count the number of samples containing l tags, i.e.Where m is the number of training samples,the value of the flag is 1 when the example i exists, and the value is 0 when the example i does not exist;
2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a label space containing q classes, | γ | represents a class in the label spaceThe number of the groups is q;
2.3 for the problem of classification error caused by unbalanced label quantity, defining the weighting factor of l label as
4. The method of claim 3, wherein the step 3 of calculating the proposed modified weighting factors for the spatial distribution of training examples comprises the following steps:
3.1 because the standard deviation can reflect the overall spatial distribution condition of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2 calculating the local label density rho of the example of which the label set contains l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position of the unseen examplel;
3.3, obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set; the specific calculation mode of the local label density influence strength of the label l is as follows: when l labels exist in the unseen example label set, the new local label density of the l labels is rhol' the influence strength of the l label of the unseen example on the local label density is calculated in a mode
3.4 calculate the proposed corrective weight factors for the spatial distribution of the training instances:where σ is the correction coefficient for the weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710724115.9A CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710724115.9A CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107526805A CN107526805A (en) | 2017-12-29 |
CN107526805B true CN107526805B (en) | 2019-12-24 |
Family
ID=60681840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710724115.9A Active CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107526805B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933579B (en) * | 2019-02-01 | 2022-12-27 | 中山大学 | Local K neighbor missing value interpolation system and method |
CN110059756A (en) * | 2019-04-23 | 2019-07-26 | 东华大学 | A kind of multi-tag categorizing system based on multiple-objection optimization |
CN111930892B (en) * | 2020-08-07 | 2023-09-29 | 重庆邮电大学 | Scientific and technological text classification method based on improved mutual information function |
CN112464973B (en) * | 2020-08-13 | 2024-02-02 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN112241454B (en) * | 2020-12-14 | 2021-02-19 | 成都数联铭品科技有限公司 | Text classification method for processing sample inclination |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7139754B2 (en) * | 2004-02-09 | 2006-11-21 | Xerox Corporation | Method for multi-class, multi-label categorization using probabilistic hierarchical modeling |
CN105095494B (en) * | 2015-08-21 | 2019-03-26 | 中国地质大学(武汉) | The method that a kind of pair of categorized data set is tested |
CN106886569B (en) * | 2017-01-13 | 2020-05-12 | 重庆邮电大学 | ML-KNN multi-tag Chinese text classification method based on MPI |
-
2017
- 2017-08-22 CN CN201710724115.9A patent/CN107526805B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107526805A (en) | 2017-12-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107526805B (en) | ML-kNN multi-tag Chinese text classification method based on weight | |
CN111177374B (en) | Question-answer corpus emotion classification method and system based on active learning | |
CN102508859B (en) | Advertisement classification method and device based on webpage characteristic | |
CN103299324B (en) | Potential son is used to mark the mark learnt for video annotation | |
CN106156204B (en) | Text label extraction method and device | |
CN108009249B (en) | Spam comment filtering method for unbalanced data and fusing user behavior rules | |
US8788503B1 (en) | Content identification | |
US8150822B2 (en) | On-line iterative multistage search engine with text categorization and supervised learning | |
CN102073864B (en) | Football item detecting system with four-layer structure in sports video and realization method thereof | |
CN105095494B (en) | The method that a kind of pair of categorized data set is tested | |
CN101561805A (en) | Document classifier generation method and system | |
CN103473380B (en) | A kind of computer version sensibility classification method | |
CN109784368A (en) | A kind of determination method and apparatus of application program classification | |
CN113627151B (en) | Cross-modal data matching method, device, equipment and medium | |
CN110008365B (en) | Image processing method, device and equipment and readable storage medium | |
CN109582783B (en) | Hot topic detection method and device | |
CN113255354B (en) | Search intention recognition method, device, server and storage medium | |
Patel et al. | Dynamic lexicon generation for natural scene images | |
CN106997379A (en) | A kind of merging method of the close text based on picture text click volume | |
CN111353045A (en) | Method for constructing text classification system | |
Kordumova et al. | Pooling objects for recognizing scenes without examples | |
US11983202B2 (en) | Computer-implemented method for improving classification of labels and categories of a database | |
CN112667813A (en) | Method for identifying sensitive identity information of referee document | |
US8498978B2 (en) | Slideshow video file detection | |
CN114817633A (en) | Video classification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20171229 Assignee: Hangzhou Yuanchuan New Technology Co.,Ltd. Assignor: HANGZHOU DIANZI University Contract record no.: X2020330000104 Denomination of invention: A weight based ml KNN multi label Chinese text classification method Granted publication date: 20191224 License type: Common License Record date: 20201125 |