CN107526805B - ML-kNN multi-tag Chinese text classification method based on weight - Google Patents

ML-kNN multi-tag Chinese text classification method based on weight Download PDF

Info

Publication number
CN107526805B
CN107526805B CN201710724115.9A CN201710724115A CN107526805B CN 107526805 B CN107526805 B CN 107526805B CN 201710724115 A CN201710724115 A CN 201710724115A CN 107526805 B CN107526805 B CN 107526805B
Authority
CN
China
Prior art keywords
label
category
labels
training
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710724115.9A
Other languages
Chinese (zh)
Other versions
CN107526805A (en
Inventor
姜明
张旻
杜炼
汤景凡
程柳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201710724115.9A priority Critical patent/CN107526805B/en
Publication of CN107526805A publication Critical patent/CN107526805A/en
Application granted granted Critical
Publication of CN107526805B publication Critical patent/CN107526805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an ML-kNN multi-label Chinese text classification method based on weight. The invention mainly aims to solve the problem that when the ML-kNN algorithm is adopted to classify multi-label Chinese texts, the misjudgment or incomplete judgment of an example label set is not shown easily caused by the condition of unbalanced number of various labels in a training set or the condition of uneven distribution of training samples in a space. The technical scheme adopted by the invention is that all adjacent labels are endowed with corresponding correction weights in a local range according to the proportion of the number of various labels in a training set, and then different weights are endowed to the labels according to mutual information of spatial distribution of an unseen example and a training example at the stage of making a decision on the unseen example label set. Meanwhile, in order to improve the classification efficiency, the text is subjected to certain pre-classification treatment before the text is formally classified, so that the classification efficiency of the multi-label Chinese text can be effectively improved.

Description

ML-kNN multi-tag Chinese text classification method based on weight
Technical Field
The invention relates to the field of text classification, in particular to an ML-kNN multi-label Chinese text classification method based on weight.
Background
The multi-tag problem is a common phenomenon in the real world, e.g., in the text category, a news article may contain several predefined topics such as "education", "sports" at the same time; in the picture category, a certain picture may have multiple scenes such as "field", "mountain range" and the like; in bioinformatics, a gene may have multiple functions such as "metabolism", "transcription", and "protein synthesis" simultaneously; in the audio category, a certain piece of music may belong to various categories such as "happy", "happy"; in the video category, a movie may belong to multiple categories such as "drama", "love". Therefore, research on multi-label classification is led, and the multi-label classification aims to learn a multi-label classifier according to a given training example and a corresponding class label set. Based on this, for any example to be classified, the classifier can predict the set of labels to which the example corresponds.
The classifier can be regarded as a learning problem and the task is to build a learner. The learner may assign the given examples to be classified to their corresponding categories. However, since the to-be-classified example can be associated with multiple classes simultaneously, it is not a two-or multi-classification problem, it is a multi-label classification problem.
The conventional multi-label classification algorithm ML-kNN adopts a classification criterion of k-nearest neighbors (k-nearest neighbors), counts class label information of neighbor samples, and infers a label set of an unseen example in a mode of maximizing posterior probability (MAP), wherein under the conditions that the number of class labels contained in a training example is unbalanced and the distribution of the training example containing the class labels in a space is uneven, misjudgment or incomplete judgment of the unseen example label set can be caused.
Disclosure of Invention
The invention aims to provide a weight-based ML-kNN multi-label Chinese text classification method aiming at the defects of the prior art, so as to solve the problem that the label quantity and the training example distribution are not considered when the traditional ML-kNN algorithm classifies multi-label Chinese texts.
A method for classifying ML-kNN multi-label Chinese texts based on weights comprises the following steps:
step 1, performing pre-classification treatment on a text to be classified;
step 2, calculating a weight factor corresponding to the problem of unbalanced label quantity;
step 3, calculating a correction weight factor proposed for the distribution of the training examples in the space;
step 4, calculating an event H based on the weight factorjThe prior probabilities of being true and false are respectively recordedAndHjrepresenting unseen example x with class label yjThis event;
step 5, calculating HjIs establishedIf not, C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
step 6, combining the results calculated in the steps 4 and 5, and obtaining a required multi-label classifier based on an ML-kNN algorithm;
and 7, combining the preprocessing result obtained in the step 1 and the classifier obtained in the step 6 to classify the unclassified texts.
The text to be classified in the step 1 is pre-classified, and the process is as follows:
1.1, firstly determining all category names, and using all category names as an original category label set;
1.2 using all text data and the latest Chinese Wikipedia corpus in the training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to be subjected to word segmentation processing, then to remove stop words and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which may have practical meaning;
1.3 expansion of class Mark sets: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2 in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, and using the word2vec to add the words in the corpus, which have similarity greater than 0.9 with the category in the original category label set, into the category label set so as to achieve the purpose of expanding the category label set, so that the expanded category label has more category representation capability;
1.4 due to the characteristics of Chinese text, when a category name appears in the text, the text is necessarily related to the category; therefore, the expanded category label set is retrieved, all texts to be classified are traversed, and the texts are labeled with the corresponding category labels.
The calculation of the weighting factor corresponding to the problem of unbalanced label number in the step 2 comprises the following steps:
2.1 count the number of samples containing l tags, i.e.Where m is the number of training samples,the value of the flag is 1 when the example i exists, and the value is 0 when the example i does not exist;
2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;
2.3 for the problem of classification error caused by unbalanced label quantity, defining the weighting factor of l label as
The calculation described in step 3 distributes the proposed correction weight factors in space for the training examples as follows:
3.1 since the standard deviation can reflect the overall spatial distribution of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2 calculating the local label density ρ of the instances where the label set contains/labels in k examples from the spatial distribution of the k nearest neighbor examples of the location where the unseen example is locatedl
3.3 obtaining k-most samples of unseen sample pairs by using mutual information of spatial distribution between unseen sample and k nearest neighbor samples in training setThe influence intensity sequence of the local label density influence intensity of each label in the adjacent example label set from low to high; the specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rhol', the influence strength of the l label on the local label density is calculated in a mode of
3.4 calculate the proposed local label density impact weights for the training examples distributed in space:where σ is the correction coefficient for the weight.
The invention has the following advantages and beneficial effects:
aiming at the problem of multi-label classification, the invention eliminates the defects of the ML-kNN algorithm in multi-label Chinese text classification by considering the quantity distribution condition of each class of labels and the distribution condition of the training examples in the space, and improves the multi-label classification effect. Meanwhile, before formal classification, the classification texts are subjected to pre-classification processing by constructing and expanding a class mark set, so that the efficiency of multi-label Chinese text classification can be greatly improved.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a method for classifying ML-kNN multi-tag chinese texts based on weights includes the following steps:
1) the text to be classified is pre-classified, and the processing process is as follows:
1.1) firstly determining all category names, and using all category names as an original category label set;
1.2) taking all text data and the latest Chinese Wikipedia corpus in a training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to word segmentation processing, then to remove stop words (adverbs, prepositions and the like) and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which can possibly have meanings;
1.3) expanding the class mark set: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2) in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, so that the words with the similarity of more than 0.9 to the category in the original category label set in the corpus are added into the category label set by using the word2vec in the invention, thereby achieving the purpose of expanding the category label set and enabling the expanded category label to have more category representation capability;
1.4) due to the characteristics of Chinese text, when a category name appears in the text, the text must be related to the category. And retrieving the expanded category label set, traversing all texts to be classified, and labeling the texts with corresponding category labels.
2) Calculating a weight factor corresponding to the problem of unbalanced label quantity, wherein the process is as follows:
2.1) count the number of samples containing the l-tag, i.e.Where m is the number of training samples,the value is 1 when the i flag of the example i exists (i.e. the example i tag set contains the i tag), otherwise the value is 0;
2.2) calculating the average number of various labels in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a mark space containing q categories, | gamma | represents the number of categories in the mark space, namely q;
2.3) number of tagsThe problem of classification error caused by imbalance is defined as the weighting factor of l labelSince the weighting factors of various labels may be relatively different, acting on the global environment will have a relatively large influence on the classification effect, so that a method for weighting the example with no local k nearest neighbors is selected.
3) The proposed corrective weight factors for the spatial distribution of the training examples are calculated as follows:
3.1) the standard deviation can reflect the overall spatial distribution condition of the example, so the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2) calculating the local label density rho of the example of the label set containing l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position where the unseen example is locatedl
3.3) obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set. The specific calculation mode of the local label density influence strength of the label l is as follows: when l tags exist in the unseen example tag set, the new local tag density of the l tags is rhol', the influence strength of the l label on the local label density is calculated in a mode of
3.4) calculate the proposed local label density impact weights for the training examples distribution in space:where σ is the correction coefficient for the weight.
4) Considering the weight factor w calculated in step 2)numCalculate event HjIs formed and is not formedThe vertical prior probabilities, respectivelyAndHjrepresenting unseen example x with class label yjThis event;
5) calculate HjIf true or false, there is C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
6) combining the results calculated in the steps 4) and 5), calculating by means of Bayes theorem based on ML-kNN algorithmThe required multi-label classifier can be obtained:
7) before classifying the text to be classified by using the classifier obtained in the step 6), directly skipping the labels already obtained in the pre-classification result obtained in the step 1), and then judging other undetermined labels.

Claims (4)

1. A method for classifying ML-kNN multi-label Chinese texts based on weights is characterized by comprising the following steps:
step 1, performing pre-classification treatment on a text to be classified;
step 2, calculating a weight factor corresponding to the problem of unbalanced label quantity;
step 3, calculating a correction weight factor proposed for the distribution of the training examples in the space;
step 4, calculating an event HjThe prior probabilities of being true and false are respectively recordedAndHjrepresenting unseen instance x with class label yjThis event;
step 5, calculating HjIf true or false, there is C in N (x)jEach sample having a class label yjAre respectively recorded asAndwhere N (x) represents a set of k neighboring samples of x in the training set, CjCounting y in N (x)jThe number of samples as their associated markers;
step 6, combining the results calculated in the steps 4 and 5, and obtaining a required multi-label classifier based on an ML-kNN algorithm;
and 7, combining the preprocessing result obtained in the step 1 and the multi-label classifier obtained in the step 6 to classify the unclassified texts.
2. The method of claim 1, wherein the step 1 of pre-classifying the text to be classified comprises the following steps:
1.1, firstly determining all category names, and using all category names as an original category label set;
1.2 using all text data and the latest Chinese Wikipedia corpus in the training set as a model corpus, wherein the Chinese Wikipedia corpus needs to be subjected to simple and complex conversion processing firstly, then to be subjected to word segmentation processing, then to remove stop words and low-frequency words, and to keep nouns, noun phrases, adjectives, verbs and other words which may have practical meaning;
1.3 expansion of class Mark sets: using a word vector representation model word2vec to represent all words in the corpus in the step 1.2 in a vector form, wherein the distance of the words in a vector space can be used for representing the semantic similarity of words, and using the word2vec to add the words in the corpus, which have similarity greater than 0.9 with the category in the original category label set, into the category label set so as to achieve the purpose of expanding the category label set, so that the expanded category label has more category representation capability;
1.4 due to the characteristics of Chinese text, when a category name appears in the text, the text is necessarily related to the category; therefore, the expanded category label set is retrieved, all texts to be classified are traversed, and the texts are labeled with the corresponding category labels.
3. The method of claim 2, wherein the step 2 of calculating the weighting factors corresponding to the problem of unbalanced label number comprises the following steps:
2.1 count the number of samples containing l tags, i.e.Where m is the number of training samples,the value of the flag is 1 when the example i exists, and the value is 0 when the example i does not exist;
2.2 calculating the average number of labels of each class in the training sample label setWherein γ ═ y1,y2,…,yqRepresents a label space containing q classes, | γ | represents a class in the label spaceThe number of the groups is q;
2.3 for the problem of classification error caused by unbalanced label quantity, defining the weighting factor of l label as
4. The method of claim 3, wherein the step 3 of calculating the proposed modified weighting factors for the spatial distribution of training examples comprises the following steps:
3.1 because the standard deviation can reflect the overall spatial distribution condition of the example, the distance standard deviation between the examples containing the same kind of labels in the local space is used as the local label density of the labels, and the symbol is defined as rho;
3.2 calculating the local label density rho of the example of which the label set contains l labels in the k examples according to the space distribution of the k nearest neighbor examples of the position of the unseen examplel
3.3, obtaining an influence intensity sequence of the unseen example on the local label density influence intensity of each label in the k nearest neighbor example label sets from low to high by utilizing mutual information of spatial distribution between the unseen example and the k nearest neighbor examples in the training set; the specific calculation mode of the local label density influence strength of the label l is as follows: when l labels exist in the unseen example label set, the new local label density of the l labels is rhol' the influence strength of the l label of the unseen example on the local label density is calculated in a mode
3.4 calculate the proposed corrective weight factors for the spatial distribution of the training instances:where σ is the correction coefficient for the weight.
CN201710724115.9A 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight Active CN107526805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710724115.9A CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710724115.9A CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Publications (2)

Publication Number Publication Date
CN107526805A CN107526805A (en) 2017-12-29
CN107526805B true CN107526805B (en) 2019-12-24

Family

ID=60681840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710724115.9A Active CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Country Status (1)

Country Link
CN (1) CN107526805B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933579B (en) * 2019-02-01 2022-12-27 中山大学 Local K neighbor missing value interpolation system and method
CN110059756A (en) * 2019-04-23 2019-07-26 东华大学 A kind of multi-tag categorizing system based on multiple-objection optimization
CN111930892B (en) * 2020-08-07 2023-09-29 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN112464973B (en) * 2020-08-13 2024-02-02 浙江师范大学 Multi-label classification method based on average distance weight and value calculation
CN112241454B (en) * 2020-12-14 2021-02-19 成都数联铭品科技有限公司 Text classification method for processing sample inclination

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7139754B2 (en) * 2004-02-09 2006-11-21 Xerox Corporation Method for multi-class, multi-label categorization using probabilistic hierarchical modeling
CN105095494B (en) * 2015-08-21 2019-03-26 中国地质大学(武汉) The method that a kind of pair of categorized data set is tested
CN106886569B (en) * 2017-01-13 2020-05-12 重庆邮电大学 ML-KNN multi-tag Chinese text classification method based on MPI

Also Published As

Publication number Publication date
CN107526805A (en) 2017-12-29

Similar Documents

Publication Publication Date Title
CN107526805B (en) ML-kNN multi-tag Chinese text classification method based on weight
CN111177374B (en) Question-answer corpus emotion classification method and system based on active learning
CN102508859B (en) Advertisement classification method and device based on webpage characteristic
CN103299324B (en) Potential son is used to mark the mark learnt for video annotation
CN106156204B (en) Text label extraction method and device
CN108009249B (en) Spam comment filtering method for unbalanced data and fusing user behavior rules
US8788503B1 (en) Content identification
US8150822B2 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN102073864B (en) Football item detecting system with four-layer structure in sports video and realization method thereof
CN105095494B (en) The method that a kind of pair of categorized data set is tested
CN101561805A (en) Document classifier generation method and system
CN103473380B (en) A kind of computer version sensibility classification method
CN109784368A (en) A kind of determination method and apparatus of application program classification
CN113627151B (en) Cross-modal data matching method, device, equipment and medium
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN109582783B (en) Hot topic detection method and device
CN113255354B (en) Search intention recognition method, device, server and storage medium
Patel et al. Dynamic lexicon generation for natural scene images
CN106997379A (en) A kind of merging method of the close text based on picture text click volume
CN111353045A (en) Method for constructing text classification system
Kordumova et al. Pooling objects for recognizing scenes without examples
US11983202B2 (en) Computer-implemented method for improving classification of labels and categories of a database
CN112667813A (en) Method for identifying sensitive identity information of referee document
US8498978B2 (en) Slideshow video file detection
CN114817633A (en) Video classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20171229

Assignee: Hangzhou Yuanchuan New Technology Co.,Ltd.

Assignor: HANGZHOU DIANZI University

Contract record no.: X2020330000104

Denomination of invention: A weight based ml KNN multi label Chinese text classification method

Granted publication date: 20191224

License type: Common License

Record date: 20201125