CN114282525A - Text classification method, system and computer equipment based on improved TF-IDF - Google Patents
Text classification method, system and computer equipment based on improved TF-IDF Download PDFInfo
- Publication number
- CN114282525A CN114282525A CN202111584594.1A CN202111584594A CN114282525A CN 114282525 A CN114282525 A CN 114282525A CN 202111584594 A CN202111584594 A CN 202111584594A CN 114282525 A CN114282525 A CN 114282525A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- class
- category
- idf
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment, and relates to the technical field of semantic networks. Constructing a text training sample, and collecting classes; and extracting the class abstract characteristics of the text training sample. The method comprises the steps of firstly carrying out word segmentation on a test sample, stopping words, converting the test sample into a vector space model according to a characteristic weighting function, similarly extracting the first m characteristics to represent a test text, then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the type of the test text. And updating the text training sample library. The invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results.
Description
Technical Field
The invention belongs to the technical field of semantic networks, and particularly relates to an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment.
Background
At present, with the development and popularization of the internet, the demand of users for various digital information is increasing. Meanwhile, ways to acquire digital information are increasing.
However, the quality of the information obtained is not uniform, which causes many difficulties for the user to process the information. By means of the automatic processing technology, data can be conveniently organized and managed, a large amount of processing time can be saved, and more importantly, errors generated in the manual processing process can be avoided. Automatic text classification techniques have been widely used in many areas of social life as an effective means of handling large amounts of text information, and have achieved satisfactory results. The TF-IDF method is a commonly used feature item weight calculation method in the text vectorization process, and measures the importance of feature items in the whole document set.
Aiming at the problem that the TF-IDF method cannot reflect the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories in the text classification process, the invention provides an improved TF-IDF text classification method based on the document categories and the inter-category distinguishing degree and the intra-category contribution degree of the feature items.
Through the above analysis, the problems and defects of the prior art are as follows: (1) the TF-IDF method in the text classification process of the prior art cannot embody the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories. The resulting feature values cannot represent class features and current document features. (2) The classification result obtained by the prior art does not accord with the empirical value, and the accuracy rate is low. The prior art is inefficient in obtaining classification results. (4) The prior art cannot provide help for improving the efficiency and accuracy of subsequent information retrieval.
Disclosure of Invention
In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a method, system and computer device for text classification based on improved TF-IDF.
The technical scheme is as follows: a text classification method based on improved TF-IDF includes extracting class abstract features of constructed text training samples; the extraction comprises the following steps:
traversing each category in the text training sample set, and counting the document number N of each categoryi;
Traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit;
Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each featureit,
Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each featureit;
Calculating the feature f under each category by using an objective functionitAnd extracting the top m representative category abstract features.
In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating the inter-class discrimination D of each class for the current featureitThe specific calculation process comprises the following steps:
if a feature item appears high in a certain class and low in other classes, it is given a high weight; the inter-class discrimination is calculated as:
in the formula PitThe probability value for the average occurrence of feature t in category i,is the n-th in the category iitThe word frequency of the feature t is contained in the document,is the n-th in the category iitMaximum value of feature word frequency in the document;
in the formula, QtThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types;
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitaverage occurrence probability value, Q, in category i for feature ttIs the word frequency that the feature t appears between different classes.
In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating an intra-class contribution S of each class to the current featureitThe specific calculation process comprises the following steps:
if a feature item appears more uniformly in all documents of a certain class, a higher weight is given; contribution degree S in classitThe calculation is as follows:
in the formula, nidtThe word frequency of the d document containing the characteristic t in the category i.
In an embodiment of the present invention, the calculating the feature f under each category by using an objective functionitExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitContribution degree to class;
according to fitTaking the first m values from large to small to obtain a category abstract feature set S1=(s1,…,sm)。
In an embodiment, before extracting the class abstract features of the constructed text training sample, the following steps are performed:
constructing a text training sample of D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; and (4) performing word segmentation on the text, and converting the word operation into a vector space model.
In an embodiment, after extracting the class abstract features of the constructed text training sample, the following steps are performed:
the test sample is firstly subjected to word segmentation to stop words, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space model is Fw={f1,…,fmCalculating the similarity degree between the text and each type, and taking the type with the highest similarity degree as the test text;
and updating the text training sample library.
And the similarity between the test recalculation text and each type is obtained, the type with the highest similarity is the type to which the test text belongs, and the specific calculation process comprises the following steps:
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) And (5) similarity, finding the class to which the test text belongs.
It is another object of the present invention to provide an improved TF-IDF based text classification system for implementing the improved TF-IDF based text classification method, the improved TF-IDF based text classification system comprising:
a text training sample construction module for constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
the category abstract feature extraction module is used for extracting category abstract features of the text training samples;
the test sample module is used for firstly carrying out word segmentation and word deactivation, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test text, wherein the vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text;
and the updating module is used for updating the text training sample library.
It is a further object of the invention to provide a computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the improved TF-IDF based text classification method.
The technical scheme provided by the embodiment of the invention has the following beneficial effects: the invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results. The invention improves the efficiency and accuracy of subsequent information retrieval.
The effects and advantages obtained by combining experimental or experimental data with the prior art are: it can be seen from fig. 4 that the improved TF-IDF text classification method of the present invention has the same effect of improving the text classification effect as the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as disclosed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart of a text classification method based on improved TF-IDF provided by an embodiment of the present invention.
Fig. 2 is a schematic diagram of a text classification method based on improved TF-IDF according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of an improved TF-IDF text classification system according to an embodiment of the present invention.
In the figure: 1. a text training sample construction module; 2. a category abstract feature extraction module; 3. a test sample module; 4. and updating the module.
FIG. 4 is a histogram of the mean performance of the micro-scale features provided by an embodiment of the present invention using the 3 feature weighting method and the classification method of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
As shown in fig. 1, the improved TF-IDF text classification method according to the embodiment of the present disclosure includes:
s101: constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
s102: and extracting the class abstract characteristics of the text training sample.
S103: the test sample is firstly subjected to word segmentation, words are removed, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space modelIs Fw=(f1,…,fm) And then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text.
S104: and updating the text training sample library.
Fig. 2 is a schematic diagram of the improved TF-IDF text classification method principle according to the embodiment of the present invention.
In a preferred embodiment of the present invention, step S102 specifically includes:
step 2.1: traversing each category in the text training sample set, and counting the document number N of each categoryi;
Step 2.2: traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit;
Step 2.3: traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each featureitThe specific calculation process is as follows: :
if a feature item appears in a certain class with high frequency and rarely appears in other classes, the feature item has strong class distinguishing capability and should be given higher weight. Namely, the calculation process of the inter-class distinction degree is as follows:
above formula PitThe probability value for the average occurrence of feature t in category i,is the n-th in the category iitThe word frequency of the feature t is contained in the document,is the n-th in the category iitMaximum value of feature word frequency, n, in documentitAs illustrated in step 2.2;
above formula QtThe word frequency of the characteristic t appearing between different classes, L is the number of sample class types, nitAs illustrated in step 2.2, NiAs illustrated in step 2.1;
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitas above, the average occurrence probability value, Q, in the class i for the feature ttAs in the above equation, the word frequency that occurs between different classes for feature t.
Step 2.4: traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each featureitThe specific calculation process is as follows:
if a feature item appears more uniformly in all documents in a certain class, it is stated that the feature item has strong representativeness to the class and should be given a higher weight. I.e. the degree of contribution S within the classitThe calculation process is as follows:
the above formula nidtThe word frequency N of the d document containing the characteristic t in the category iiAs illustrated in step 2.1.
Step 2.5: calculating the feature f under each category by using an objective functionitExtracting the abstract features of the first m representative categories, wherein the concrete calculation process is as follows:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitIs the degree of contribution within the class.
According to fitTaking the first m values from large to small to obtain a category abstract feature set S1=(s1,…,sm);
In a preferred embodiment of the present invention, step S103: the test sample is firstlyPerforming word segmentation, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector model of the test text is Fw=(f1,…,fm) Calculating the similarity between the text and each type, and taking the type to which the text with the highest similarity belongs, wherein the specific calculation process is as follows:
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) Similarity, i.e. find the class to which the test text belongs.
In a preferred embodiment of the present invention, step S104: updating a text training sample library;
the data set of the present invention is a corpus made up of news report data provided by the road agency. The corpus subset formed by four categories is adopted, and the number of the documents contained in the four categories is respectively as follows: 1620. 1738, 2765 and 2453; this data set contained a total of 8576 documents, 26745 features.
As shown in fig. 3, the improved TF-IDF text classification system provided by the embodiment of the present invention includes:
a text training sample construction module 1, configured to construct a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
and the category abstract feature extraction module 2 is used for extracting category abstract features of the text training sample.
A test sample module 3 for performing word segmentation to stop words, converting the test sample into a vector space model according to the characteristic weighting function, extracting the first m characteristics to represent the test text, wherein the vector space model is Fw=(f1,…,fm) Then, the similarity degree between the text and each type is calculated,and taking the type with the highest similarity as the test text.
And the updating module 4 is used for updating the text training sample library.
The positive effects of the present invention are further described below in conjunction with specific experimental data.
As shown in FIG. 4, the method classifier employs the traditional TF-IDF, considers only class factor weights, the feature weighting method, and the present invention based on the improved TF-IDF text classification method to run experiments on the above data sets, and the micro-average FIThe value: the x-axis represents the text classification method and the y-axis represents the corresponding micro-average FIA value;
the local classification performance of the classifier on a single class can only be measured due to accuracy and recall. When evaluating the global classification performance of a classifier, a micro-average value F is typically usedIAnd a macro average.
Micro-averaging:
the above formula H, G indicates the accuracy and recall, respectively;
upper type TNlNumber of classes l, NNC′For a number that does not belong to the class l,the number of the non-classified items belonging to the classified item l.
The improved TF-IDF text classification method has the advantages that the improved text classification effect is consistent with the effect of the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure should be limited only by the attached claims.
Claims (10)
1. An improved TF-IDF text classification method, characterized in that the improved TF-IDF text classification method comprises:
traversing each category in the text training sample set, and counting the document number N of each categoryi;
Traversing each feature in the text training sample set, and counting the number n of documents containing current features in each categoryit;
Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each featureit,
Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature for each featureit;
Calculating the feature f under each category by using an objective functionitAnd extracting the top m representative category abstract features.
2. The improved TF-IDF based text classification method according to claim 1, wherein said method is characterized in thatTraversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each featureitThe specific calculation process comprises the following steps:
a feature item appears at a certain class of high frequencies and is given a high weight; the inter-class discrimination is calculated as:
in the formula PitThe probability value for the average occurrence of feature t in category i,is the n-th in the category iitThe word frequency of the feature t is contained in the document,is the n-th in the category iitMaximum value of feature word frequency in the document;
in the formula, QtThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types.
3. The improved TF-IDF based text classification method according to claim 2,
inter-class distinction degree DitThe formula is as follows:
Dit=Pit*Qt
Pitaverage occurrence probability value, Q, in category i for feature ttIs the word frequency that the feature t appears between different classes.
4. The improved TF-IDF based text classification method according to claim 1, wherein each of said ergodic text training sample setsEach feature, calculating the contribution degree S in each category to the current featureitThe specific calculation process comprises the following steps:
all documents of a certain class of a feature item are uniform and are given higher weight; contribution degree S in classitThe calculation is as follows:
in the formula, nidtThe word frequency of the d document containing the characteristic t in the category i.
5. The improved TF-IDF based text classification method according to claim 1 wherein said calculating the feature f per class using an objective functionitExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:
fit=Dit*Sit
above formula DitFor inter-class distinction, SitContribution degree to class;
according to fitTaking the first m values from large to small to obtain a category abstract feature set Sl=(s1,…,sm)。
6. The improved TF-IDF based text classification method according to claim 1, wherein before extracting the class abstract features of the constructed text training samples, the following steps are performed:
constructing a text training sample of D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; and (4) performing word segmentation on the text, and converting the word operation into a vector space model.
7. The improved TF-IDF based text classification method according to claim 1, wherein after extracting the class abstract features of the constructed text training samples, the method further comprises:
firstly, segmenting the test sample, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;
and updating the text training sample library.
8. The improved TF-IDF based text classification method according to claim 7,
and the similarity between the test recalculation text and each type is obtained, the type with the highest similarity is the type to which the test text belongs, and the specific calculation process comprises the following steps:
the test text vector is sequentially subjected to sim (S) with each class abstract feature vectorl,Fw) Operation, get the maximum sim (S)l,Fw) And (5) similarity, finding the class to which the test text belongs.
9. An improved TF-IDF text classification system implementing the improved TF-IDF text classification method according to any one of claims 1 to 7, wherein the improved TF-IDF text classification system comprises:
a text training sample construction module for constructing a text training sample D ═ D1,…,dNC, a category set is C ═ C1,…,cL}; performing word segmentation on the text, and converting the word operation into a vector space model;
the category abstract feature extraction module is used for extracting category abstract features of the text training samples;
a test sample module for performing word segmentation to stop words, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test textThe vector space model is Fw=(f1,…,fm) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;
and the updating module is used for updating the text training sample library.
10. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the improved TF-IDF text classification method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111584594.1A CN114282525A (en) | 2021-12-22 | 2021-12-22 | Text classification method, system and computer equipment based on improved TF-IDF |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111584594.1A CN114282525A (en) | 2021-12-22 | 2021-12-22 | Text classification method, system and computer equipment based on improved TF-IDF |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114282525A true CN114282525A (en) | 2022-04-05 |
Family
ID=80874064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111584594.1A Pending CN114282525A (en) | 2021-12-22 | 2021-12-22 | Text classification method, system and computer equipment based on improved TF-IDF |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114282525A (en) |
-
2021
- 2021-12-22 CN CN202111584594.1A patent/CN114282525A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110633725B (en) | Method and device for training classification model and classification method and device | |
CN109492026B (en) | Telecommunication fraud classification detection method based on improved active learning technology | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
CN111914090B (en) | Method and device for enterprise industry classification identification and characteristic pollutant identification | |
CN112347244B (en) | Yellow-based and gambling-based website detection method based on mixed feature analysis | |
US20120136812A1 (en) | Method and system for machine-learning based optimization and customization of document similarities calculation | |
CN110019792A (en) | File classification method and device and sorter model training method | |
CN109271517B (en) | IG TF-IDF text feature vector generation and text classification method | |
CN109993040A (en) | Text recognition method and device | |
CN104361037B (en) | Microblogging sorting technique and device | |
CN108363717B (en) | Data security level identification and detection method and device | |
CN113254643B (en) | Text classification method and device, electronic equipment and text classification program | |
CN113076734A (en) | Similarity detection method and device for project texts | |
CN110781333A (en) | Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning | |
CN115600194A (en) | Intrusion detection method, storage medium and device based on XGboost and LGBM | |
CN111753299A (en) | Unbalanced malicious software detection method based on packet integration | |
CN115169534A (en) | Sample optimization training method of convolutional neural network and computer readable storage medium | |
CN113378563B (en) | Case feature extraction method and device based on genetic variation and semi-supervision | |
CN110895703B (en) | Legal document case recognition method and device | |
CN115408527B (en) | Text classification method and device, electronic equipment and storage medium | |
CN114282525A (en) | Text classification method, system and computer equipment based on improved TF-IDF | |
CN115510331A (en) | Shared resource matching method based on idle amount aggregation | |
CN113656575B (en) | Training data generation method and device, electronic equipment and readable medium | |
CN115238707A (en) | Law enforcement video evaluation method and device based on word vector semantic analysis | |
CN115358340A (en) | Credit credit collection short message distinguishing method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |