CN107526805A - A kind of ML kNN multi-tag Chinese Text Categorizations based on weight - Google Patents
A kind of ML kNN multi-tag Chinese Text Categorizations based on weight Download PDFInfo
- Publication number
- CN107526805A CN107526805A CN201710724115.9A CN201710724115A CN107526805A CN 107526805 A CN107526805 A CN 107526805A CN 201710724115 A CN201710724115 A CN 201710724115A CN 107526805 A CN107526805 A CN 107526805A
- Authority
- CN
- China
- Prior art keywords
- tag
- label
- text
- classification
- labels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Abstract
The invention discloses a kind of ML kNN multi-tag Chinese Text Categorizations based on weight.When the present invention is mainly for solving to classify to multi-tag Chinese text using ML kNN algorithms, the problem of situation pockety easily causes the erroneous judgement for having no example tag collection or imperfect judgement in space for the unbalanced situation of all kinds of number of labels in training set or training sample.The technical solution adopted by the present invention is, according to the proportion of all kinds of number of labels in training set corresponding amendment weight is assigned in subrange to the label of all neighbours, then, to having no that example tag collection carries out the stage of decision-making, the weight different to the imparting of each label from the mutual information of training example spatial distribution according to example is had no.Meanwhile in order to lift the efficiency of classification, the present invention has done certain processing of presorting before text Official Classification to text, can effectively lift the classification effectiveness of multi-tag Chinese text.
Description
Technical field
The present invention relates to text classification field, and in particular to a kind of ML-kNN multi-tags Chinese text point based on weight
Class method.
Background technology
Multi-tag problem is phenomenon common in a kind of real world, and e.g., in text categories, certain news release may
Include several predetermined themes such as " education ", " physical culture " simultaneously;In picture classification, certain pictures may be simultaneously present a variety of
Scene such as " field ", " mountain range ";In bioinformatics, a gene may have simultaneously multiple functions such as " metabolism ",
" transcription " and " protein synthesis ";In audio categories, certain song may belong to plurality of classes such as " happy ", " height
It is emerging ";In video classification, certain film may belong to plurality of classes such as " story of a play or opera ", " love " simultaneously.Thus more marks have been drawn
The research of classification is signed, the target of multi-tag classification is according to the training example and its corresponding class label set provided, study
Go out multi-tag grader.Based on this, for any example to be sorted, grader can predict the label corresponding to the example
Set.
Grader can be used as a problem concerning study, and task is one learner of structure.Learner will can provide
Example to be sorted assign to its corresponding classification.However, due to example to be sorted can simultaneously, institute associated with multiple classifications
It is not one two classification or polytypic problem with it, it is a multi-tag classification problem.
Traditional multi-tag sorting algorithm ML-kNN algorithms are accurate using " k neighbours (k-nearest neighbors) " classification
Then, the class label information of neighbour's sample is counted, by " maximizing posterior probability (maximum a posteriori, brief note
For MAP) " mode reasoning have no the tag set of example, the quantity of its all kinds of distinguishing label included in training example is uneven
Weigh, include the training examples in space in the case of skewness of all kinds of distinguishing labels, is likely to result in and has no example
The erroneous judgement of tally set or imperfect judgement.
The content of the invention
The purpose of the present invention is in view of the shortcomings of the prior art, there is provided a kind of ML-kNN multi-tags Chinese text based on weight
This sorting technique, with solve not accounting for when traditional ML-kNN algorithms classify to multi-tag Chinese text number of labels and
The problem of training example is distributed.
A kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, comprise the following steps:
Step 1, processing of presorting is carried out to text to be sorted;
Step 2, calculate the weight factor being directed to corresponding to number of labels imbalance problem;
Step 3, calculating are distributed proposed amendment weight factor in space for training example;
Step 4, based on weight factor, calculate event HjEstablishment and invalid prior probability, are designated as respectivelyWithHjRepresentative has no that example x has category label yjThis event;
Step 5, calculate HjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability,
It is designated as respectivelyWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed,
CjCount y in N (x)jNumber of samples as its mark of correlation;
Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling classification
Device;
Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is divided
Class.
Processing of presorting is carried out to text to be sorted described in step 1, process is as follows:
1.1 determine all categories title first, and the other tag set of primitive class is referred to as using all categories name;
1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model,
Wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then remove stop words with it is low
Frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3 expand classification tag set:Model word2vec is represented using term vector, by institute in corpus in step 1.2
Some words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically
Similarity, with word2vec will in corpus with classification similarity in original classification tag set more than 0.9 word add
Category label set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with category table
Sign ability;
1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification;
Therefore the category label set after retrieval is expanded, travels through all texts to be sorted, by class label corresponding to text mark.
Weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process are as follows:
2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark durations, and otherwise value is 0;
2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ
={ y1,y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3 are directed to caused by number of labels imbalance the problem of error in classification, and the weight factor of definition l labels is
Calculating described in step 3 is distributed proposed amendment weight factor in space for training example, and process is such as
Under:
3.1 can react the overall space distribution situation of example due to standard deviation, so being used in the present invention in local space
Include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ;
3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection
The partial tag density p of example comprising l labelsl;
3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, are had no
Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high
Sequence;The wherein label l specific calculation of partial tag Effects of Density intensity is:When have no example tag concentrate l marks be present
During label, the new partial tag density of l labels is ρl', have no influence Strength co-mputation of the l labels of example to local label densities
Mode is
3.4 calculating are distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
Advantages of the present invention and have the beneficial effect that:
The present invention is directed to multi-tag classification problem, because the distributed number situation and training that consider all kinds of distinguishing labels are shown
The distribution situation of example in space, eliminates the defects of ML-kNN algorithms are existing in multi-tag Chinese Text Categorization, improves
The effect of multi-tag classification.Meanwhile before Official Classification, by building and expanding classification tag set to classifying text
Presort processing, can greatly improve the efficiency of multi-tag Chinese Text Categorization.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the method for the invention.
Embodiment
The invention will be further described below in conjunction with the accompanying drawings.
A kind of reference picture 1, ML-kNN multi-tag Chinese Text Categorizations based on weight, comprises the following steps:
1) processing of presorting is carried out to text to be sorted, processing procedure is as follows:
1.1) all categories title is determined first, and the other tag set of primitive class is referred to as using all categories name;
1.2) language material using text data all in training set and newest Chinese wikipedia corpus as model
Storehouse, wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, word segmentation processing is then done, then removes stop words
(adverbial word, preposition etc.) and low-frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3) classification tag set is expanded:Model word2vec is represented using term vector, by step 1.2) in corpus
All words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semanteme
On similarity, therefore will be big with classification similarity in original classification tag set in corpus with word2vec in the present invention
Word in 0.9 adds category label set, to reach the purpose for expanding classification tag set so that the classification mark after expansion
Note is more added with categorized representation ability;
1.4) due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable and this classification phase
Close.Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.
2) weight factor corresponding to number of labels imbalance problem is calculated, process is as follows:
2.1) number of samples for including l labels is calculated, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark (including l labels in example i tally set) durations, and otherwise value is 0;
2.2) average of all kinds of number of labels in training sample tally set is calculatedWherein
γ={ y1,y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3) it is directed to caused by number of labels imbalance the problem of error in classification, the weight factor of definition l labels isBecause the weighted factor of all kinds of labels may differ by it is bigger, act on the overall situation
Words can produce large effect to classifying quality, so selection has no the side that the example of k arest neighbors of example part is weighted
Method.
3) calculate and be distributed proposed amendment weight factor in space for training example, process is as follows:
3.1) because standard deviation can react the overall space distribution situation of example, so using local space in the present invention
In include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition is
ρ;
3.2) according to the spatial distribution for k arest neighbors example for having no example present position, k paradigms label are calculated
The partial tag density p of example of the collection comprising l labelsl;
3.3) using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, had no
Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high
Sequence.The wherein label l specific calculation of partial tag Effects of Density intensity is:When have no example tag concentrate l marks be present
During label, the new partial tag density of l labels is ρl', have no influence Strength co-mputation of the l labels of example to local label densities
Mode is
3.4) calculate and be distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
4) the weight factor w calculated in step 2) is considerednum, calculate event HjEstablishment and invalid prior probability, point
It is not designated asWithHjRepresentative has no that example x has category label yjThis event;
5) H is calculatedjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability, remember respectively
ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, CjStatistics
By y in N (x)jNumber of samples as its mark of correlation;
6) step 4) and the result 5) calculated are combined, based on ML-kNN algorithms, is calculated by Bayes' theoremIt can obtain required multiple labeling grader:
7) treated in the grader drawn using step 6) before classifying text classified, first directly skip step 1)
The label obtained in obtained result of presorting, then do not determine that label judges to other again.
Claims (4)
1. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, it is characterised in that comprise the following steps:
Step 1, processing of presorting is carried out to text to be sorted;
Step 2, calculate the weight factor being directed to corresponding to number of labels imbalance problem;
Step 3, calculating are distributed proposed amendment weight factor in space for training example;
Step 4, based on weight factor, calculate event HjEstablishment and invalid prior probability, are designated as respectivelyWithHjRepresentative has no that example x has category label yjThis event;
Step 5, calculate HjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability, remember respectively
ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, CjStatistics
By y in N (x)jNumber of samples as its mark of correlation;
Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling grader;
Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is classified.
2. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 1, its feature exist
In carrying out processing of presorting to text to be sorted described in step 1, process is as follows:
1.1 determine all categories title first, and the other tag set of primitive class is referred to as using all categories name;
1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model, wherein
Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then removes stop words and low-frequency word,
Retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3 expand classification tag set:Using term vector represent model word2vec, by step 1.2 in corpus it is all
Word is represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically similar
Degree, classification mark will be added with word of the classification similarity in original classification tag set more than 0.9 in corpus with word2vec
Note set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with categorized representation ability;
1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification;Therefore
Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.
3. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 2, its feature exist
It is as follows in weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process:
2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,Table
It is 1 to show that example i has l mark durations, and otherwise value is 0;
2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ={ y1,
y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3 are directed to caused by number of labels imbalance the problem of error in classification, and the weight factor of definition l labels is
4. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 3, its feature exist
Itd is proposed amendment weight factor is distributed in space for training example in the calculating described in step 3, and process is as follows:
3.1 can react the overall space distribution situation of example due to standard deviation, so being included in the present invention with local space
There are partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ;
3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection and include l
The partial tag density p of the example of labell;
3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, obtain having no example pair
The influence sequence of intensity of the partial tag Effects of Density intensity of each label from low to high contained by k arest neighbors example tag collection;Its
The middle label l specific calculation of partial tag Effects of Density intensity is:When having no that example tag concentration has l labels, l marks
It is ρ to sign new partial tag densityl', have no that influence Strength co-mputation mode of the l labels of example to local label densities is
3.4 calculating are distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710724115.9A CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710724115.9A CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107526805A true CN107526805A (en) | 2017-12-29 |
CN107526805B CN107526805B (en) | 2019-12-24 |
Family
ID=60681840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710724115.9A Active CN107526805B (en) | 2017-08-22 | 2017-08-22 | ML-kNN multi-tag Chinese text classification method based on weight |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107526805B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933579A (en) * | 2019-02-01 | 2019-06-25 | 中山大学 | A kind of part k nearest neighbor missing values interpolation system and method |
CN110059756A (en) * | 2019-04-23 | 2019-07-26 | 东华大学 | A kind of multi-tag categorizing system based on multiple-objection optimization |
CN111930892A (en) * | 2020-08-07 | 2020-11-13 | 重庆邮电大学 | Scientific and technological text classification method based on improved mutual information function |
CN112241454A (en) * | 2020-12-14 | 2021-01-19 | 成都数联铭品科技有限公司 | Text classification method for processing sample inclination |
CN112464973A (en) * | 2020-08-13 | 2021-03-09 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050187892A1 (en) * | 2004-02-09 | 2005-08-25 | Xerox Corporation | Method for multi-class, multi-label categorization using probabilistic hierarchical modeling |
CN105095494A (en) * | 2015-08-21 | 2015-11-25 | 中国地质大学(武汉) | Method for testing categorical data set |
CN106886569A (en) * | 2017-01-13 | 2017-06-23 | 重庆邮电大学 | A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI |
-
2017
- 2017-08-22 CN CN201710724115.9A patent/CN107526805B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050187892A1 (en) * | 2004-02-09 | 2005-08-25 | Xerox Corporation | Method for multi-class, multi-label categorization using probabilistic hierarchical modeling |
CN105095494A (en) * | 2015-08-21 | 2015-11-25 | 中国地质大学(武汉) | Method for testing categorical data set |
CN106886569A (en) * | 2017-01-13 | 2017-06-23 | 重庆邮电大学 | A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI |
Non-Patent Citations (1)
Title |
---|
李思男等: "多标签数据挖掘技术:研究综述", 《计算机科学》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933579A (en) * | 2019-02-01 | 2019-06-25 | 中山大学 | A kind of part k nearest neighbor missing values interpolation system and method |
CN110059756A (en) * | 2019-04-23 | 2019-07-26 | 东华大学 | A kind of multi-tag categorizing system based on multiple-objection optimization |
CN111930892A (en) * | 2020-08-07 | 2020-11-13 | 重庆邮电大学 | Scientific and technological text classification method based on improved mutual information function |
CN111930892B (en) * | 2020-08-07 | 2023-09-29 | 重庆邮电大学 | Scientific and technological text classification method based on improved mutual information function |
CN112464973A (en) * | 2020-08-13 | 2021-03-09 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN112464973B (en) * | 2020-08-13 | 2024-02-02 | 浙江师范大学 | Multi-label classification method based on average distance weight and value calculation |
CN112241454A (en) * | 2020-12-14 | 2021-01-19 | 成都数联铭品科技有限公司 | Text classification method for processing sample inclination |
Also Published As
Publication number | Publication date |
---|---|
CN107526805B (en) | 2019-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107526805A (en) | A kind of ML kNN multi-tag Chinese Text Categorizations based on weight | |
CN107169049B (en) | Application tag information generation method and device | |
CN107609121A (en) | Newsletter archive sorting technique based on LDA and word2vec algorithms | |
Barbieri et al. | Multimodal emoji prediction | |
US20190205393A1 (en) | A cross-media search method | |
CN110532379B (en) | Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis | |
CN103810274B (en) | Multi-characteristic image tag sorting method based on WordNet semantic similarities | |
Bin et al. | Adaptively attending to visual attributes and linguistic knowledge for captioning | |
CN108090216B (en) | Label prediction method, device and storage medium | |
CN110674312B (en) | Method, device and medium for constructing knowledge graph and electronic equipment | |
CN109492105B (en) | Text emotion classification method based on multi-feature ensemble learning | |
CN113553429B (en) | Normalized label system construction and text automatic labeling method | |
CN112819023A (en) | Sample set acquisition method and device, computer equipment and storage medium | |
CN105138991A (en) | Video emotion identification method based on emotion significant feature integration | |
CN104537028B (en) | A kind of Web information processing method and device | |
CN106997341A (en) | A kind of innovation scheme matching process, device, server and system | |
CN110569920A (en) | prediction method for multi-task machine learning | |
CN109446393B (en) | Network community topic classification method and device | |
CN108090099B (en) | Text processing method and device | |
CN111858896A (en) | Knowledge base question-answering method based on deep learning | |
CN106227836B (en) | Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters | |
CN112182249A (en) | Automatic classification method and device for aviation safety report | |
Jolly et al. | How do convolutional neural networks learn design? | |
CN111339338B (en) | Text picture matching recommendation method based on deep learning | |
CN109271513A (en) | A kind of file classification method, computer-readable storage media and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20171229 Assignee: Hangzhou Yuanchuan New Technology Co.,Ltd. Assignor: HANGZHOU DIANZI University Contract record no.: X2020330000104 Denomination of invention: A weight based ml KNN multi label Chinese text classification method Granted publication date: 20191224 License type: Common License Record date: 20201125 |