CN107526805A - A kind of ML kNN multi-tag Chinese Text Categorizations based on weight - Google Patents

A kind of ML kNN multi-tag Chinese Text Categorizations based on weight Download PDF

Info

Publication number
CN107526805A
CN107526805A CN201710724115.9A CN201710724115A CN107526805A CN 107526805 A CN107526805 A CN 107526805A CN 201710724115 A CN201710724115 A CN 201710724115A CN 107526805 A CN107526805 A CN 107526805A
Authority
CN
China
Prior art keywords
tag
label
text
classification
labels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710724115.9A
Other languages
Chinese (zh)
Other versions
CN107526805B (en
Inventor
姜明
张旻
杜炼
汤景凡
程柳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN201710724115.9A priority Critical patent/CN107526805B/en
Publication of CN107526805A publication Critical patent/CN107526805A/en
Application granted granted Critical
Publication of CN107526805B publication Critical patent/CN107526805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes

Abstract

The invention discloses a kind of ML kNN multi-tag Chinese Text Categorizations based on weight.When the present invention is mainly for solving to classify to multi-tag Chinese text using ML kNN algorithms, the problem of situation pockety easily causes the erroneous judgement for having no example tag collection or imperfect judgement in space for the unbalanced situation of all kinds of number of labels in training set or training sample.The technical solution adopted by the present invention is, according to the proportion of all kinds of number of labels in training set corresponding amendment weight is assigned in subrange to the label of all neighbours, then, to having no that example tag collection carries out the stage of decision-making, the weight different to the imparting of each label from the mutual information of training example spatial distribution according to example is had no.Meanwhile in order to lift the efficiency of classification, the present invention has done certain processing of presorting before text Official Classification to text, can effectively lift the classification effectiveness of multi-tag Chinese text.

Description

A kind of ML-kNN multi-tag Chinese Text Categorizations based on weight
Technical field
The present invention relates to text classification field, and in particular to a kind of ML-kNN multi-tags Chinese text point based on weight Class method.
Background technology
Multi-tag problem is phenomenon common in a kind of real world, and e.g., in text categories, certain news release may Include several predetermined themes such as " education ", " physical culture " simultaneously;In picture classification, certain pictures may be simultaneously present a variety of Scene such as " field ", " mountain range ";In bioinformatics, a gene may have simultaneously multiple functions such as " metabolism ", " transcription " and " protein synthesis ";In audio categories, certain song may belong to plurality of classes such as " happy ", " height It is emerging ";In video classification, certain film may belong to plurality of classes such as " story of a play or opera ", " love " simultaneously.Thus more marks have been drawn The research of classification is signed, the target of multi-tag classification is according to the training example and its corresponding class label set provided, study Go out multi-tag grader.Based on this, for any example to be sorted, grader can predict the label corresponding to the example Set.
Grader can be used as a problem concerning study, and task is one learner of structure.Learner will can provide Example to be sorted assign to its corresponding classification.However, due to example to be sorted can simultaneously, institute associated with multiple classifications It is not one two classification or polytypic problem with it, it is a multi-tag classification problem.
Traditional multi-tag sorting algorithm ML-kNN algorithms are accurate using " k neighbours (k-nearest neighbors) " classification Then, the class label information of neighbour's sample is counted, by " maximizing posterior probability (maximum a posteriori, brief note For MAP) " mode reasoning have no the tag set of example, the quantity of its all kinds of distinguishing label included in training example is uneven Weigh, include the training examples in space in the case of skewness of all kinds of distinguishing labels, is likely to result in and has no example The erroneous judgement of tally set or imperfect judgement.
The content of the invention
The purpose of the present invention is in view of the shortcomings of the prior art, there is provided a kind of ML-kNN multi-tags Chinese text based on weight This sorting technique, with solve not accounting for when traditional ML-kNN algorithms classify to multi-tag Chinese text number of labels and The problem of training example is distributed.
A kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, comprise the following steps:
Step 1, processing of presorting is carried out to text to be sorted;
Step 2, calculate the weight factor being directed to corresponding to number of labels imbalance problem;
Step 3, calculating are distributed proposed amendment weight factor in space for training example;
Step 4, based on weight factor, calculate event HjEstablishment and invalid prior probability, are designated as respectivelyWithHjRepresentative has no that example x has category label yjThis event;
Step 5, calculate HjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability, It is designated as respectivelyWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, CjCount y in N (x)jNumber of samples as its mark of correlation;
Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling classification Device;
Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is divided Class.
Processing of presorting is carried out to text to be sorted described in step 1, process is as follows:
1.1 determine all categories title first, and the other tag set of primitive class is referred to as using all categories name;
1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model, Wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then remove stop words with it is low Frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3 expand classification tag set:Model word2vec is represented using term vector, by institute in corpus in step 1.2 Some words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically Similarity, with word2vec will in corpus with classification similarity in original classification tag set more than 0.9 word add Category label set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with category table Sign ability;
1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification; Therefore the category label set after retrieval is expanded, travels through all texts to be sorted, by class label corresponding to text mark.
Weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process are as follows:
2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark durations, and otherwise value is 0;
2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ ={ y1,y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3 are directed to caused by number of labels imbalance the problem of error in classification, and the weight factor of definition l labels is
Calculating described in step 3 is distributed proposed amendment weight factor in space for training example, and process is such as Under:
3.1 can react the overall space distribution situation of example due to standard deviation, so being used in the present invention in local space Include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ;
3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection The partial tag density p of example comprising l labelsl
3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, are had no Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high Sequence;The wherein label l specific calculation of partial tag Effects of Density intensity is:When have no example tag concentrate l marks be present During label, the new partial tag density of l labels is ρl', have no influence Strength co-mputation of the l labels of example to local label densities Mode is
3.4 calculating are distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
Advantages of the present invention and have the beneficial effect that:
The present invention is directed to multi-tag classification problem, because the distributed number situation and training that consider all kinds of distinguishing labels are shown The distribution situation of example in space, eliminates the defects of ML-kNN algorithms are existing in multi-tag Chinese Text Categorization, improves The effect of multi-tag classification.Meanwhile before Official Classification, by building and expanding classification tag set to classifying text Presort processing, can greatly improve the efficiency of multi-tag Chinese Text Categorization.
Brief description of the drawings
Fig. 1 is the FB(flow block) of the method for the invention.
Embodiment
The invention will be further described below in conjunction with the accompanying drawings.
A kind of reference picture 1, ML-kNN multi-tag Chinese Text Categorizations based on weight, comprises the following steps:
1) processing of presorting is carried out to text to be sorted, processing procedure is as follows:
1.1) all categories title is determined first, and the other tag set of primitive class is referred to as using all categories name;
1.2) language material using text data all in training set and newest Chinese wikipedia corpus as model Storehouse, wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, word segmentation processing is then done, then removes stop words (adverbial word, preposition etc.) and low-frequency word, retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3) classification tag set is expanded:Model word2vec is represented using term vector, by step 1.2) in corpus All words are represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semanteme On similarity, therefore will be big with classification similarity in original classification tag set in corpus with word2vec in the present invention Word in 0.9 adds category label set, to reach the purpose for expanding classification tag set so that the classification mark after expansion Note is more added with categorized representation ability;
1.4) due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable and this classification phase Close.Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.
2) weight factor corresponding to number of labels imbalance problem is calculated, process is as follows:
2.1) number of samples for including l labels is calculated, i.e.,Wherein m is the number of training sample,It is 1 to represent that example i has l mark (including l labels in example i tally set) durations, and otherwise value is 0;
2.2) average of all kinds of number of labels in training sample tally set is calculatedWherein γ={ y1,y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3) it is directed to caused by number of labels imbalance the problem of error in classification, the weight factor of definition l labels isBecause the weighted factor of all kinds of labels may differ by it is bigger, act on the overall situation Words can produce large effect to classifying quality, so selection has no the side that the example of k arest neighbors of example part is weighted Method.
3) calculate and be distributed proposed amendment weight factor in space for training example, process is as follows:
3.1) because standard deviation can react the overall space distribution situation of example, so using local space in the present invention In include partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition is ρ;
3.2) according to the spatial distribution for k arest neighbors example for having no example present position, k paradigms label are calculated The partial tag density p of example of the collection comprising l labelsl
3.3) using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, had no Influence intensity of the example to the partial tag Effects of Density intensity of each label contained by k arest neighbors example tag collection from low to high Sequence.The wherein label l specific calculation of partial tag Effects of Density intensity is:When have no example tag concentrate l marks be present During label, the new partial tag density of l labels is ρl', have no influence Strength co-mputation of the l labels of example to local label densities Mode is
3.4) calculate and be distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
4) the weight factor w calculated in step 2) is considerednum, calculate event HjEstablishment and invalid prior probability, point It is not designated asWithHjRepresentative has no that example x has category label yjThis event;
5) H is calculatedjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability, remember respectively ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, CjStatistics By y in N (x)jNumber of samples as its mark of correlation;
6) step 4) and the result 5) calculated are combined, based on ML-kNN algorithms, is calculated by Bayes' theoremIt can obtain required multiple labeling grader:
7) treated in the grader drawn using step 6) before classifying text classified, first directly skip step 1) The label obtained in obtained result of presorting, then do not determine that label judges to other again.

Claims (4)

1. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight, it is characterised in that comprise the following steps:
Step 1, processing of presorting is carried out to text to be sorted;
Step 2, calculate the weight factor being directed to corresponding to number of labels imbalance problem;
Step 3, calculating are distributed proposed amendment weight factor in space for training example;
Step 4, based on weight factor, calculate event HjEstablishment and invalid prior probability, are designated as respectivelyWithHjRepresentative has no that example x has category label yjThis event;
Step 5, calculate HjWhen setting up and be invalid, there is C in N (x)jIndividual sample has category label yjConditional probability, remember respectively ForWithWherein N (x) represents the set that k neighbour samples of the x in training set is formed, CjStatistics By y in N (x)jNumber of samples as its mark of correlation;
Step 6, the result calculated with reference to step 4 and 5, based on ML-kNN algorithms, obtain required multiple labeling grader;
Step 7, with reference to the grader that the pre-processed results and step 6 that step 1 is drawn are drawn unfiled text is classified.
2. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 1, its feature exist In carrying out processing of presorting to text to be sorted described in step 1, process is as follows:
1.1 determine all categories title first, and the other tag set of primitive class is referred to as using all categories name;
1.2 corpus using text data all in training set and newest Chinese wikipedia corpus as model, wherein Chinese wikipedia corpus needs first to do conversion between simplified and traditional Chinese processing, then does word segmentation processing, then removes stop words and low-frequency word, Retain noun, noun phrase, adjective, verb and other there may be sincere word;
1.3 expand classification tag set:Using term vector represent model word2vec, by step 1.2 in corpus it is all Word is represented with the form of vector, and distance of the word in vector space can be used for representing text including words and phrases semantically similar Degree, classification mark will be added with word of the classification similarity in original classification tag set more than 0.9 in corpus with word2vec Note set, to reach the purpose for expanding classification tag set so that the category label after expansion is more added with categorized representation ability;
1.4 due to Chinese text the characteristics of, when class name occurs in the text, this text is inevitable related to this classification;Therefore Category label set after retrieval expansion, travels through all texts to be sorted, by class label corresponding to text mark.
3. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 2, its feature exist It is as follows in weight factor of the calculating corresponding to number of labels imbalance problem described in step 2, process:
2.1 calculate the number of samples for including l labels, i.e.,Wherein m is the number of training sample,Table It is 1 to show that example i has l mark durations, and otherwise value is 0;
2.2 calculate the average of all kinds of number of labels in training sample tally setWherein γ={ y1, y2,…,yqThe label space for including q classification is represented, | γ | the number for representing classification in label space is q;
2.3 are directed to caused by number of labels imbalance the problem of error in classification, and the weight factor of definition l labels is
4. a kind of ML-kNN multi-tag Chinese Text Categorizations based on weight according to claim 3, its feature exist Itd is proposed amendment weight factor is distributed in space for training example in the calculating described in step 3, and process is as follows:
3.1 can react the overall space distribution situation of example due to standard deviation, so being included in the present invention with local space There are partial tag density of the distance between the example of the label of the same race standard deviation as such label, symbol definition ρ;
3.2 bases have no the spatial distribution of k arest neighbors example of example present position, calculate k paradigms label collection and include l The partial tag density p of the example of labell
3.3, using the mutual information for having no spatial distribution between k arest neighbors example in example and training set, obtain having no example pair The influence sequence of intensity of the partial tag Effects of Density intensity of each label from low to high contained by k arest neighbors example tag collection;Its The middle label l specific calculation of partial tag Effects of Density intensity is:When having no that example tag concentration has l labels, l marks It is ρ to sign new partial tag densityl', have no that influence Strength co-mputation mode of the l labels of example to local label densities is
3.4 calculating are distributed proposed partial tag Effects of Density weight in space for training example:Wherein σ is the correction factor of weight.
CN201710724115.9A 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight Active CN107526805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710724115.9A CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710724115.9A CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Publications (2)

Publication Number Publication Date
CN107526805A true CN107526805A (en) 2017-12-29
CN107526805B CN107526805B (en) 2019-12-24

Family

ID=60681840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710724115.9A Active CN107526805B (en) 2017-08-22 2017-08-22 ML-kNN multi-tag Chinese text classification method based on weight

Country Status (1)

Country Link
CN (1) CN107526805B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933579A (en) * 2019-02-01 2019-06-25 中山大学 A kind of part k nearest neighbor missing values interpolation system and method
CN110059756A (en) * 2019-04-23 2019-07-26 东华大学 A kind of multi-tag categorizing system based on multiple-objection optimization
CN111930892A (en) * 2020-08-07 2020-11-13 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN112241454A (en) * 2020-12-14 2021-01-19 成都数联铭品科技有限公司 Text classification method for processing sample inclination
CN112464973A (en) * 2020-08-13 2021-03-09 浙江师范大学 Multi-label classification method based on average distance weight and value calculation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187892A1 (en) * 2004-02-09 2005-08-25 Xerox Corporation Method for multi-class, multi-label categorization using probabilistic hierarchical modeling
CN105095494A (en) * 2015-08-21 2015-11-25 中国地质大学(武汉) Method for testing categorical data set
CN106886569A (en) * 2017-01-13 2017-06-23 重庆邮电大学 A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050187892A1 (en) * 2004-02-09 2005-08-25 Xerox Corporation Method for multi-class, multi-label categorization using probabilistic hierarchical modeling
CN105095494A (en) * 2015-08-21 2015-11-25 中国地质大学(武汉) Method for testing categorical data set
CN106886569A (en) * 2017-01-13 2017-06-23 重庆邮电大学 A kind of ML KNN multi-tag Chinese Text Categorizations based on MPI

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李思男等: "多标签数据挖掘技术:研究综述", 《计算机科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933579A (en) * 2019-02-01 2019-06-25 中山大学 A kind of part k nearest neighbor missing values interpolation system and method
CN110059756A (en) * 2019-04-23 2019-07-26 东华大学 A kind of multi-tag categorizing system based on multiple-objection optimization
CN111930892A (en) * 2020-08-07 2020-11-13 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN111930892B (en) * 2020-08-07 2023-09-29 重庆邮电大学 Scientific and technological text classification method based on improved mutual information function
CN112464973A (en) * 2020-08-13 2021-03-09 浙江师范大学 Multi-label classification method based on average distance weight and value calculation
CN112464973B (en) * 2020-08-13 2024-02-02 浙江师范大学 Multi-label classification method based on average distance weight and value calculation
CN112241454A (en) * 2020-12-14 2021-01-19 成都数联铭品科技有限公司 Text classification method for processing sample inclination

Also Published As

Publication number Publication date
CN107526805B (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN107526805A (en) A kind of ML kNN multi-tag Chinese Text Categorizations based on weight
CN107169049B (en) Application tag information generation method and device
CN107609121A (en) Newsletter archive sorting technique based on LDA and word2vec algorithms
Barbieri et al. Multimodal emoji prediction
US20190205393A1 (en) A cross-media search method
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN103810274B (en) Multi-characteristic image tag sorting method based on WordNet semantic similarities
Bin et al. Adaptively attending to visual attributes and linguistic knowledge for captioning
CN108090216B (en) Label prediction method, device and storage medium
CN110674312B (en) Method, device and medium for constructing knowledge graph and electronic equipment
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN113553429B (en) Normalized label system construction and text automatic labeling method
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN105138991A (en) Video emotion identification method based on emotion significant feature integration
CN104537028B (en) A kind of Web information processing method and device
CN106997341A (en) A kind of innovation scheme matching process, device, server and system
CN110569920A (en) prediction method for multi-task machine learning
CN109446393B (en) Network community topic classification method and device
CN108090099B (en) Text processing method and device
CN111858896A (en) Knowledge base question-answering method based on deep learning
CN106227836B (en) Unsupervised joint visual concept learning system and unsupervised joint visual concept learning method based on images and characters
CN112182249A (en) Automatic classification method and device for aviation safety report
Jolly et al. How do convolutional neural networks learn design?
CN111339338B (en) Text picture matching recommendation method based on deep learning
CN109271513A (en) A kind of file classification method, computer-readable storage media and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20171229

Assignee: Hangzhou Yuanchuan New Technology Co.,Ltd.

Assignor: HANGZHOU DIANZI University

Contract record no.: X2020330000104

Denomination of invention: A weight based ml KNN multi label Chinese text classification method

Granted publication date: 20191224

License type: Common License

Record date: 20201125