CN114282525A - Text classification method, system and computer equipment based on improved TF-IDF - Google Patents

Text classification method, system and computer equipment based on improved TF-IDF Download PDF

Info

Publication number
CN114282525A
CN114282525A CN202111584594.1A CN202111584594A CN114282525A CN 114282525 A CN114282525 A CN 114282525A CN 202111584594 A CN202111584594 A CN 202111584594A CN 114282525 A CN114282525 A CN 114282525A
Authority
CN
China
Prior art keywords
text
feature
class
category
idf
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111584594.1A
Other languages
Chinese (zh)
Inventor
金平艳
石珺
李志鹏
廖勇
杨阳朝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanglian Anrui Network Technology Co ltd
Original Assignee
Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wanglian Anrui Network Technology Co ltd filed Critical Shenzhen Wanglian Anrui Network Technology Co ltd
Priority to CN202111584594.1A priority Critical patent/CN114282525A/en
Publication of CN114282525A publication Critical patent/CN114282525A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment, and relates to the technical field of semantic networks. Constructing a text training sample, and collecting classes; and extracting the class abstract characteristics of the text training sample. The method comprises the steps of firstly carrying out word segmentation on a test sample, stopping words, converting the test sample into a vector space model according to a characteristic weighting function, similarly extracting the first m characteristics to represent a test text, then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the type of the test text. And updating the text training sample library. The invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results.

Description

Text classification method, system and computer equipment based on improved TF-IDF
Technical Field
The invention belongs to the technical field of semantic networks, and particularly relates to an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment.
Background
At present, with the development and popularization of the internet, the demand of users for various digital information is increasing. Meanwhile, ways to acquire digital information are increasing.
However, the quality of the information obtained is not uniform, which causes many difficulties for the user to process the information. By means of the automatic processing technology, data can be conveniently organized and managed, a large amount of processing time can be saved, and more importantly, errors generated in the manual processing process can be avoided. Automatic text classification techniques have been widely used in many areas of social life as an effective means of handling large amounts of text information, and have achieved satisfactory results. The TF-IDF method is a commonly used feature item weight calculation method in the text vectorization process, and measures the importance of feature items in the whole document set.
Aiming at the problem that the TF-IDF method cannot reflect the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories in the text classification process, the invention provides an improved TF-IDF text classification method based on the document categories and the inter-category distinguishing degree and the intra-category contribution degree of the feature items.
Through the above analysis, the problems and defects of the prior art are as follows: (1) the TF-IDF method in the text classification process of the prior art cannot embody the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories. The resulting feature values cannot represent class features and current document features. (2) The classification result obtained by the prior art does not accord with the empirical value, and the accuracy rate is low. The prior art is inefficient in obtaining classification results. (4) The prior art cannot provide help for improving the efficiency and accuracy of subsequent information retrieval.
Disclosure of Invention
In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a method, system and computer device for text classification based on improved TF-IDF.
The technical scheme is as follows: a text classification method based on improved TF-IDF includes extracting class abstract features of constructed text training samples; the extraction comprises the following steps:
traversing each category in the text training sample set, and counting the document number N of each categoryi
Traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit
Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each featureit
Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each featureit
Calculating the feature f under each category by using an objective functionitAnd extracting the top m representative category abstract features.
In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating the inter-class discrimination D of each class for the current featureitThe specific calculation process comprises the following steps:
if a feature item appears high in a certain class and low in other classes, it is given a high weight; the inter-class discrimination is calculated as:
Figure BDA0003427457910000021
in the formula PitThe probability value for the average occurrence of feature t in category i,
Figure BDA0003427457910000022
is the n-th in the category iitThe word frequency of the feature t is contained in the document,
Figure BDA0003427457910000023
is the n-th in the category iitMaximum value of feature word frequency in the document;
Figure BDA0003427457910000024
in the formula, QtThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types;
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitaverage occurrence probability value, Q, in category i for feature ttIs the word frequency that the feature t appears between different classes.
In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating an intra-class contribution S of each class to the current featureitThe specific calculation process comprises the following steps:
if a feature item appears more uniformly in all documents of a certain class, a higher weight is given; contribution degree S in classitThe calculation is as follows:
Figure BDA0003427457910000031
in the formula, nidtThe word frequency of the d document containing the characteristic t in the category i.
In an embodiment of the present invention, the calculating the feature f under each category by using an objective functionitExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitContribution degree to class;
according to fitTaking the first m values from large to small to obtain a category abstract feature set S1=(s1,…,sm)。
In an embodiment, before extracting the class abstract features of the constructed text training sample, the following steps are performed:
constructing a text training sample of D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; and (4) performing word segmentation on the text, and converting the word operation into a vector space model.
In an embodiment, after extracting the class abstract features of the constructed text training sample, the following steps are performed:
the test sample is firstly subjected to word segmentation to stop words, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space model is Fw={f1,…,fmCalculating the similarity degree between the text and each type, and taking the type with the highest similarity degree as the test text;
and updating the text training sample library.
And the similarity between the test recalculation text and each type is obtained, the type with the highest similarity is the type to which the test text belongs, and the specific calculation process comprises the following steps:
Figure BDA0003427457910000041
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) And (5) similarity, finding the class to which the test text belongs.
It is another object of the present invention to provide an improved TF-IDF based text classification system for implementing the improved TF-IDF based text classification method, the improved TF-IDF based text classification system comprising:
a text training sample construction module for constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
the category abstract feature extraction module is used for extracting category abstract features of the text training samples;
the test sample module is used for firstly carrying out word segmentation and word deactivation, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test text, wherein the vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text;
and the updating module is used for updating the text training sample library.
It is a further object of the invention to provide a computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the improved TF-IDF based text classification method.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results. The invention improves the efficiency and accuracy of subsequent information retrieval.
The effects and advantages obtained by combining experimental or experimental data with the prior art are: it can be seen from fig. 4 that the improved TF-IDF text classification method of the present invention has the same effect of improving the text classification effect as the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as disclosed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart of a text classification method based on improved TF-IDF provided by an embodiment of the present invention.
Fig. 2 is a schematic diagram of a text classification method based on improved TF-IDF according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of an improved TF-IDF text classification system according to an embodiment of the present invention.
In the figure: 1. a text training sample construction module; 2. a category abstract feature extraction module; 3. a test sample module; 4. and updating the module.
FIG. 4 is a histogram of the mean performance of the micro-scale features provided by an embodiment of the present invention using the 3 feature weighting method and the classification method of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
As shown in fig. 1, the improved TF-IDF text classification method according to the embodiment of the present disclosure includes:
s101: constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
s102: and extracting the class abstract characteristics of the text training sample.
S103: the test sample is firstly subjected to word segmentation, words are removed, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space modelIs Fw=(f1,…,fm) And then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text.
S104: and updating the text training sample library.
Fig. 2 is a schematic diagram of the improved TF-IDF text classification method principle according to the embodiment of the present invention.
In a preferred embodiment of the present invention, step S102 specifically includes:
step 2.1: traversing each category in the text training sample set, and counting the document number N of each categoryi
Step 2.2: traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit
Step 2.3: traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each featureitThe specific calculation process is as follows: :
if a feature item appears in a certain class with high frequency and rarely appears in other classes, the feature item has strong class distinguishing capability and should be given higher weight. Namely, the calculation process of the inter-class distinction degree is as follows:
Figure BDA0003427457910000061
above formula PitThe probability value for the average occurrence of feature t in category i,
Figure BDA0003427457910000062
is the n-th in the category iitThe word frequency of the feature t is contained in the document,
Figure BDA0003427457910000063
is the n-th in the category iitMaximum value of feature word frequency, n, in documentitAs illustrated in step 2.2;
Figure BDA0003427457910000064
above formula QtThe word frequency of the characteristic t appearing between different classes, L is the number of sample class types, nitAs illustrated in step 2.2, NiAs illustrated in step 2.1;
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitas above, the average occurrence probability value, Q, in the class i for the feature ttAs in the above equation, the word frequency that occurs between different classes for feature t.
Step 2.4: traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each featureitThe specific calculation process is as follows:
if a feature item appears more uniformly in all documents in a certain class, it is stated that the feature item has strong representativeness to the class and should be given a higher weight. I.e. the degree of contribution S within the classitThe calculation process is as follows:
Figure BDA0003427457910000071
the above formula nidtThe word frequency N of the d document containing the characteristic t in the category iiAs illustrated in step 2.1.
Step 2.5: calculating the feature f under each category by using an objective functionitExtracting the abstract features of the first m representative categories, wherein the concrete calculation process is as follows:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitIs the degree of contribution within the class.
According to fitTaking the first m values from large to small to obtain a category abstract feature set S1=(s1,…,sm);
In a preferred embodiment of the present invention, step S103: the test sample is firstlyPerforming word segmentation, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector model of the test text is Fw=(f1,…,fm) Calculating the similarity between the text and each type, and taking the type to which the text with the highest similarity belongs, wherein the specific calculation process is as follows:
Figure BDA0003427457910000072
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) Similarity, i.e. find the class to which the test text belongs.
In a preferred embodiment of the present invention, step S104: updating a text training sample library;
the data set of the present invention is a corpus made up of news report data provided by the road agency. The corpus subset formed by four categories is adopted, and the number of the documents contained in the four categories is respectively as follows: 1620. 1738, 2765 and 2453; this data set contained a total of 8576 documents, 26745 features.
As shown in fig. 3, the improved TF-IDF text classification system provided by the embodiment of the present invention includes:
a text training sample construction module 1, configured to construct a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
and the category abstract feature extraction module 2 is used for extracting category abstract features of the text training sample.
A test sample module 3 for performing word segmentation to stop words, converting the test sample into a vector space model according to the characteristic weighting function, extracting the first m characteristics to represent the test text, wherein the vector space model is Fw=(f1,…,fm) Then, the similarity degree between the text and each type is calculated,and taking the type with the highest similarity as the test text.
And the updating module 4 is used for updating the text training sample library.
The positive effects of the present invention are further described below in conjunction with specific experimental data.
As shown in FIG. 4, the method classifier employs the traditional TF-IDF, considers only class factor weights, the feature weighting method, and the present invention based on the improved TF-IDF text classification method to run experiments on the above data sets, and the micro-average FIThe value: the x-axis represents the text classification method and the y-axis represents the corresponding micro-average FIA value;
the local classification performance of the classifier on a single class can only be measured due to accuracy and recall. When evaluating the global classification performance of a classifier, a micro-average value F is typically usedIAnd a macro average.
Micro-averaging:
Figure BDA0003427457910000081
the above formula H, G indicates the accuracy and recall, respectively;
Figure BDA0003427457910000082
Figure BDA0003427457910000083
upper type TNlNumber of classes l, NNC′For a number that does not belong to the class l,
Figure BDA0003427457910000091
the number of the non-classified items belonging to the classified item l.
The improved TF-IDF text classification method has the advantages that the improved text classification effect is consistent with the effect of the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure should be limited only by the attached claims.

Claims (10)

1. An improved TF-IDF text classification method, characterized in that the improved TF-IDF text classification method comprises:
traversing each category in the text training sample set, and counting the document number N of each categoryi
Traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit
Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each featureit
Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature for each featureit
Calculating the feature f under each category by using an objective functionitAnd extracting the top m representative category abstract features.
2. The improved TF-IDF based text classification method according to claim 1, wherein said method is characterized in thatTraversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each featureitThe specific calculation process comprises the following steps:
a feature item appears at a certain class of high frequencies and is given a high weight; the inter-class discrimination is calculated as:
Figure FDA0003427457900000011
in the formula PitThe probability value for the average occurrence of feature t in category i,
Figure FDA0003427457900000012
is the n-th in the category iitThe word frequency of the feature t is contained in the document,
Figure FDA0003427457900000013
is the n-th in the category iitMaximum value of feature word frequency in the document;
Figure FDA0003427457900000014
in the formula, QtThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types.
3. The improved TF-IDF based text classification method according to claim 2,
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitaverage occurrence probability value, Q, in category i for feature ttIs the word frequency that the feature t appears between different classes.
4. The improved TF-IDF based text classification method according to claim 1, wherein each of said ergodic text training sample setsEach feature, calculating the contribution degree S in each category to the current featureitThe specific calculation process comprises the following steps:
all documents of a certain class of a feature item are uniform and are given higher weight; contribution degree S in classitThe calculation is as follows:
Figure FDA0003427457900000021
in the formula, nidtThe word frequency of the d document containing the characteristic t in the category i.
5. The improved TF-IDF based text classification method according to claim 1 wherein said calculating the feature f per class using an objective functionitExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitContribution degree to class;
according to fitTaking the first m values from large to small to obtain a category abstract feature set Sl=(s1,…,sm)。
6. The improved TF-IDF based text classification method according to claim 1, wherein before extracting the class abstract features of the constructed text training samples, the following steps are performed:
constructing a text training sample of D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; and (4) performing word segmentation on the text, and converting the word operation into a vector space model.
7. The improved TF-IDF based text classification method according to claim 1, wherein after extracting the class abstract features of the constructed text training samples, the method further comprises:
firstly, segmenting the test sample, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;
and updating the text training sample library.
8. The improved TF-IDF based text classification method according to claim 7,
and the similarity between the test recalculation text and each type is obtained, the type with the highest similarity is the type to which the test text belongs, and the specific calculation process comprises the following steps:
Figure FDA0003427457900000031
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) And (5) similarity, finding the class to which the test text belongs.
9. An improved TF-IDF text classification system implementing the improved TF-IDF text classification method according to any one of claims 1 to 7, wherein the improved TF-IDF text classification system comprises:
a text training sample construction module for constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
the category abstract feature extraction module is used for extracting category abstract features of the text training samples;
a test sample module for performing word segmentation to stop words, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test textThe vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;
and the updating module is used for updating the text training sample library.
10. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the improved TF-IDF text classification method according to any one of claims 1 to 7.
CN202111584594.1A 2021-12-22 2021-12-22 Text classification method, system and computer equipment based on improved TF-IDF Pending CN114282525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111584594.1A CN114282525A (en) 2021-12-22 2021-12-22 Text classification method, system and computer equipment based on improved TF-IDF

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111584594.1A CN114282525A (en) 2021-12-22 2021-12-22 Text classification method, system and computer equipment based on improved TF-IDF

Publications (1)

Publication Number Publication Date
CN114282525A true CN114282525A (en) 2022-04-05

Family

ID=80874064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111584594.1A Pending CN114282525A (en) 2021-12-22 2021-12-22 Text classification method, system and computer equipment based on improved TF-IDF

Country Status (1)

Country Link
CN (1) CN114282525A (en)

Similar Documents

Publication Publication Date Title
CN110633725B (en) Method and device for training classification model and classification method and device
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN111914090B (en) Method and device for enterprise industry classification identification and characteristic pollutant identification
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
US20120136812A1 (en) Method and system for machine-learning based optimization and customization of document similarities calculation
CN110019792A (en) File classification method and device and sorter model training method
CN109271517B (en) IG TF-IDF text feature vector generation and text classification method
CN109993040A (en) Text recognition method and device
CN104361037B (en) Microblogging sorting technique and device
CN108363717B (en) Data security level identification and detection method and device
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN113076734A (en) Similarity detection method and device for project texts
CN110781333A (en) Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning
CN115600194A (en) Intrusion detection method, storage medium and device based on XGboost and LGBM
CN111753299A (en) Unbalanced malicious software detection method based on packet integration
CN115169534A (en) Sample optimization training method of convolutional neural network and computer readable storage medium
CN113378563B (en) Case feature extraction method and device based on genetic variation and semi-supervision
CN110895703B (en) Legal document case recognition method and device
CN115408527B (en) Text classification method and device, electronic equipment and storage medium
CN114282525A (en) Text classification method, system and computer equipment based on improved TF-IDF
CN115510331A (en) Shared resource matching method based on idle amount aggregation
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN115238707A (en) Law enforcement video evaluation method and device based on word vector semantic analysis
CN115358340A (en) Credit credit collection short message distinguishing method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination