CN114282525A

CN114282525A - Text classification method, system and computer equipment based on improved TF-IDF

Info

Publication number: CN114282525A
Application number: CN202111584594.1A
Authority: CN
Inventors: 金平艳; 石珺; 李志鹏; 廖勇; 杨阳朝
Original assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-05

Abstract

The invention discloses an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment, and relates to the technical field of semantic networks. Constructing a text training sample, and collecting classes; and extracting the class abstract characteristics of the text training sample. The method comprises the steps of firstly carrying out word segmentation on a test sample, stopping words, converting the test sample into a vector space model according to a characteristic weighting function, similarly extracting the first m characteristics to represent a test text, then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the type of the test text. And updating the text training sample library. The invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results.

Description

Text classification method, system and computer equipment based on improved TF-IDF

Technical Field

The invention belongs to the technical field of semantic networks, and particularly relates to an improved TF-IDF text classification method, an improved TF-IDF text classification system and computer equipment.

Background

At present, with the development and popularization of the internet, the demand of users for various digital information is increasing. Meanwhile, ways to acquire digital information are increasing.

However, the quality of the information obtained is not uniform, which causes many difficulties for the user to process the information. By means of the automatic processing technology, data can be conveniently organized and managed, a large amount of processing time can be saved, and more importantly, errors generated in the manual processing process can be avoided. Automatic text classification techniques have been widely used in many areas of social life as an effective means of handling large amounts of text information, and have achieved satisfactory results. The TF-IDF method is a commonly used feature item weight calculation method in the text vectorization process, and measures the importance of feature items in the whole document set.

Aiming at the problem that the TF-IDF method cannot reflect the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories in the text classification process, the invention provides an improved TF-IDF text classification method based on the document categories and the inter-category distinguishing degree and the intra-category contribution degree of the feature items.

Through the above analysis, the problems and defects of the prior art are as follows: (1) the TF-IDF method in the text classification process of the prior art cannot embody the distinguishing capability of the feature items on the categories and the representativeness of the feature items on the categories. The resulting feature values cannot represent class features and current document features. (2) The classification result obtained by the prior art does not accord with the empirical value, and the accuracy rate is low. The prior art is inefficient in obtaining classification results. (4) The prior art cannot provide help for improving the efficiency and accuracy of subsequent information retrieval.

Disclosure of Invention

In order to overcome the problems in the related art, the disclosed embodiments of the present invention provide a method, system and computer device for text classification based on improved TF-IDF.

The technical scheme is as follows: a text classification method based on improved TF-IDF includes extracting class abstract features of constructed text training samples; the extraction comprises the following steps:

traversing each category in the text training sample set, and counting the document number N of each category_i；

Traversing each feature in the text training sample set, and counting the number n of documents containing current features in each category_it；

Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each feature_it，

Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each feature_it；

Calculating the feature f under each category by using an objective function_itAnd extracting the top m representative category abstract features.

In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating the inter-class discrimination D of each class for the current feature_itThe specific calculation process comprises the following steps:

if a feature item appears high in a certain class and low in other classes, it is given a high weight; the inter-class discrimination is calculated as:

in the formula P_itThe probability value for the average occurrence of feature t in category i,

is the n-th in the category i_itThe word frequency of the feature t is contained in the document,

is the n-th in the category i_itMaximum value of feature word frequency in the document;

in the formula, Q_tThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types;

inter-class distinction degree D_itThe formula is as follows:

D_it＝P_it*Q_t

P_itaverage occurrence probability value, Q, in category i for feature t_tIs the word frequency that the feature t appears between different classes.

In one embodiment, the method comprises traversing each feature in the text training sample set, and for each feature, calculating an intra-class contribution S of each class to the current feature_itThe specific calculation process comprises the following steps:

if a feature item appears more uniformly in all documents of a certain class, a higher weight is given; contribution degree S in class_itThe calculation is as follows:

in the formula, n_idtThe word frequency of the d document containing the characteristic t in the category i.

In an embodiment of the present invention, the calculating the feature f under each category by using an objective function_itExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:

f_it＝D_it*S_it

above formula D_itFor inter-class distinction, S_itContribution degree to class;

according to f_itTaking the first m values from large to small to obtain a category abstract feature set S₁＝(s₁，…，s_m)。

In an embodiment, before extracting the class abstract features of the constructed text training sample, the following steps are performed:

constructing a text training sample of D ═ D₁，…，d_NC, a category set is C ═ C₁，…，c_L}; and (4) performing word segmentation on the text, and converting the word operation into a vector space model.

In an embodiment, after extracting the class abstract features of the constructed text training sample, the following steps are performed:

the test sample is firstly subjected to word segmentation to stop words, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space model is F_w＝{f₁，…，f_mCalculating the similarity degree between the text and each type, and taking the type with the highest similarity degree as the test text;

and updating the text training sample library.

And the similarity between the test recalculation text and each type is obtained, the type with the highest similarity is the type to which the test text belongs, and the specific calculation process comprises the following steps:

the test text vector is sequentially subjected to sim (S) with each class abstract feature vector_l，F_w) Operation, get the maximum sim (S)_l，F_w) And (5) similarity, finding the class to which the test text belongs.

It is another object of the present invention to provide an improved TF-IDF based text classification system for implementing the improved TF-IDF based text classification method, the improved TF-IDF based text classification system comprising:

a text training sample construction module for constructing a text training sample D ═ D₁，…，d_NC, a category set is C ═ C₁，…，c_L}; performing word segmentation on the text, and converting the word operation into a vector space model;

the category abstract feature extraction module is used for extracting category abstract features of the text training samples;

the test sample module is used for firstly carrying out word segmentation and word deactivation, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test text, wherein the vector space model is F_w＝(f₁，…，f_m) Then, calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text;

and the updating module is used for updating the text training sample library.

It is a further object of the invention to provide a computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the improved TF-IDF based text classification method.

The technical scheme provided by the embodiment of the invention has the following beneficial effects: the invention comprehensively considers the distinguishing degree between the characteristic classes and the contribution degree in the classes, and the obtained characteristic value can represent the class characteristic and the current document characteristic. The classification result obtained by the method is more in line with the empirical value, and the accuracy is higher. The algorithm of the invention has the advantages of high operation processing speed, simplicity and convenience, and high-efficiency classification results. The invention improves the efficiency and accuracy of subsequent information retrieval.

The effects and advantages obtained by combining experimental or experimental data with the prior art are: it can be seen from fig. 4 that the improved TF-IDF text classification method of the present invention has the same effect of improving the text classification effect as the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as disclosed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow chart of a text classification method based on improved TF-IDF provided by an embodiment of the present invention.

Fig. 2 is a schematic diagram of a text classification method based on improved TF-IDF according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an improved TF-IDF text classification system according to an embodiment of the present invention.

In the figure: 1. a text training sample construction module; 2. a category abstract feature extraction module; 3. a test sample module; 4. and updating the module.

FIG. 4 is a histogram of the mean performance of the micro-scale features provided by an embodiment of the present invention using the 3 feature weighting method and the classification method of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

As shown in fig. 1, the improved TF-IDF text classification method according to the embodiment of the present disclosure includes:

s101: constructing a text training sample D ═ D₁，…，d_NC, a category set is C ═ C₁，…，c_L}; performing word segmentation on the text, and converting the word operation into a vector space model;

s102: and extracting the class abstract characteristics of the text training sample.

S103: the test sample is firstly subjected to word segmentation, words are removed, the test sample is converted into a vector space model according to a characteristic weighting function, the first m characteristics are extracted to represent a test text, and the vector space modelIs F_w＝(f₁，…，f_m) And then calculating the similarity degree between the text and each type, and taking the type with the highest similarity as the test text.

S104: and updating the text training sample library.

Fig. 2 is a schematic diagram of the improved TF-IDF text classification method principle according to the embodiment of the present invention.

In a preferred embodiment of the present invention, step S102 specifically includes:

step 2.1: traversing each category in the text training sample set, and counting the document number N of each category_i；

Step 2.2: traversing each feature in the text training sample set, and counting the number n of documents containing current features in each category_it；

Step 2.3: traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class for the current feature according to each feature_itThe specific calculation process is as follows: :

if a feature item appears in a certain class with high frequency and rarely appears in other classes, the feature item has strong class distinguishing capability and should be given higher weight. Namely, the calculation process of the inter-class distinction degree is as follows:

above formula P_itThe probability value for the average occurrence of feature t in category i,

is the n-th in the category i_itMaximum value of feature word frequency, n, in document_itAs illustrated in step 2.2;

above formula Q_tThe word frequency of the characteristic t appearing between different classes, L is the number of sample class types, n_itAs illustrated in step 2.2, N_iAs illustrated in step 2.1;

inter-class distinction degree D_itThe formula is as follows:

D_it＝P_it*Q_t

P_itas above, the average occurrence probability value, Q, in the class i for the feature t_tAs in the above equation, the word frequency that occurs between different classes for feature t.

Step 2.4: traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature according to each feature_itThe specific calculation process is as follows:

if a feature item appears more uniformly in all documents in a certain class, it is stated that the feature item has strong representativeness to the class and should be given a higher weight. I.e. the degree of contribution S within the class_itThe calculation process is as follows:

the above formula n_idtThe word frequency N of the d document containing the characteristic t in the category i_iAs illustrated in step 2.1.

Step 2.5: calculating the feature f under each category by using an objective function_itExtracting the abstract features of the first m representative categories, wherein the concrete calculation process is as follows:

f_it＝D_it*S_it

above formula D_itFor inter-class distinction, S_itIs the degree of contribution within the class.

According to f_itTaking the first m values from large to small to obtain a category abstract feature set S₁＝(s₁，…，s_m)；

In a preferred embodiment of the present invention, step S103: the test sample is firstlyPerforming word segmentation, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector model of the test text is F_w=(f₁，…，f_m) Calculating the similarity between the text and each type, and taking the type to which the text with the highest similarity belongs, wherein the specific calculation process is as follows:

the test text vector is sequentially subjected to sim (S) with each class abstract feature vector_l，F_w) Operation, get the maximum sim (S)_l，F_w) Similarity, i.e. find the class to which the test text belongs.

In a preferred embodiment of the present invention, step S104: updating a text training sample library;

the data set of the present invention is a corpus made up of news report data provided by the road agency. The corpus subset formed by four categories is adopted, and the number of the documents contained in the four categories is respectively as follows: 1620. 1738, 2765 and 2453; this data set contained a total of 8576 documents, 26745 features.

As shown in fig. 3, the improved TF-IDF text classification system provided by the embodiment of the present invention includes:

a text training sample construction module 1, configured to construct a text training sample D ═ D₁，…，d_NC, a category set is C ═ C₁，…，c_L}; performing word segmentation on the text, and converting the word operation into a vector space model;

and the category abstract feature extraction module 2 is used for extracting category abstract features of the text training sample.

A test sample module 3 for performing word segmentation to stop words, converting the test sample into a vector space model according to the characteristic weighting function, extracting the first m characteristics to represent the test text, wherein the vector space model is F_w=(f₁，…，f_m) Then, the similarity degree between the text and each type is calculated,and taking the type with the highest similarity as the test text.

And the updating module 4 is used for updating the text training sample library.

The positive effects of the present invention are further described below in conjunction with specific experimental data.

As shown in FIG. 4, the method classifier employs the traditional TF-IDF, considers only class factor weights, the feature weighting method, and the present invention based on the improved TF-IDF text classification method to run experiments on the above data sets, and the micro-average F_IThe value: the x-axis represents the text classification method and the y-axis represents the corresponding micro-average F_IA value;

the local classification performance of the classifier on a single class can only be measured due to accuracy and recall. When evaluating the global classification performance of a classifier, a micro-average value F is typically used_IAnd a macro average.

Micro-averaging:

the above formula H, G indicates the accuracy and recall, respectively;

upper type TN_lNumber of classes l, NN_C′For a number that does not belong to the class l,

the number of the non-classified items belonging to the classified item l.

The improved TF-IDF text classification method has the advantages that the improved text classification effect is consistent with the effect of the feature weighted text classification method, the micro-average value of the feature weighted text classification method is 0.7998, and the micro-average value of the improved TF-IDF text classification method is 0.7982. The method of the invention is higher than the traditional TF-IDF and only considers the weight method of the class factor.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure should be limited only by the attached claims.

Claims

1. An improved TF-IDF text classification method, characterized in that the improved TF-IDF text classification method comprises:

Traversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each feature_it，

Traversing each feature in the text training sample set, and calculating the intra-class contribution degree S of each class to the current feature for each feature_it；

2. The improved TF-IDF based text classification method according to claim 1, wherein said method is characterized in thatTraversing each feature in the text training sample set, and calculating the inter-class distinction degree D of each class aiming at the current feature for each feature_itThe specific calculation process comprises the following steps:

a feature item appears at a certain class of high frequencies and is given a high weight; the inter-class discrimination is calculated as:

in the formula, Q_tThe word frequency of the characteristic t appearing among different classes, and L is the number of sample class types.

3. The improved TF-IDF based text classification method according to claim 2,

inter-class distinction degree D_itThe formula is as follows:

D_it＝P_it*Q_t

4. The improved TF-IDF based text classification method according to claim 1, wherein each of said ergodic text training sample setsEach feature, calculating the contribution degree S in each category to the current feature_itThe specific calculation process comprises the following steps:

all documents of a certain class of a feature item are uniform and are given higher weight; contribution degree S in class_itThe calculation is as follows:

5. The improved TF-IDF based text classification method according to claim 1 wherein said calculating the feature f per class using an objective function_itExtracting the abstract features of the first m representative categories, and specifically calculating the abstract features of the first m representative categories, wherein the method comprises the following steps:

f_it＝D_it*S_it

according to f_itTaking the first m values from large to small to obtain a category abstract feature set S_l＝(s₁，…，s_m)。

6. The improved TF-IDF based text classification method according to claim 1, wherein before extracting the class abstract features of the constructed text training samples, the following steps are performed:

7. The improved TF-IDF based text classification method according to claim 1, wherein after extracting the class abstract features of the constructed text training samples, the method further comprises:

firstly, segmenting the test sample, removing stop words, converting the test sample into a vector space model according to a characteristic weighting function, extracting the first m characteristics to represent a test text, wherein the vector space model is F_w＝(f₁，…，f_m) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;

and updating the text training sample library.

8. The improved TF-IDF based text classification method according to claim 7,

9. An improved TF-IDF text classification system implementing the improved TF-IDF text classification method according to any one of claims 1 to 7, wherein the improved TF-IDF text classification system comprises:

a test sample module for performing word segmentation to stop words, converting the test sample into a vector space model according to a characteristic weighting function, and extracting the first m characteristics to represent a test textThe vector space model is F_w＝(f₁，…，f_m) Then, calculating the similarity degree between the text and each type to obtain the type of the test text;

and the updating module is used for updating the text training sample library.

10. A computer arrangement, characterized in that the computer arrangement comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the improved TF-IDF text classification method according to any one of claims 1 to 7.