CN112256865B - Chinese text classification method based on classifier - Google Patents

Chinese text classification method based on classifier Download PDF

Info

Publication number
CN112256865B
CN112256865B CN202011019598.0A CN202011019598A CN112256865B CN 112256865 B CN112256865 B CN 112256865B CN 202011019598 A CN202011019598 A CN 202011019598A CN 112256865 B CN112256865 B CN 112256865B
Authority
CN
China
Prior art keywords
text
category
feature
class
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011019598.0A
Other languages
Chinese (zh)
Other versions
CN112256865A (en
Inventor
陈卓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202011019598.0A priority Critical patent/CN112256865B/en
Publication of CN112256865A publication Critical patent/CN112256865A/en
Application granted granted Critical
Publication of CN112256865B publication Critical patent/CN112256865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Chinese text classification method based on a classifier, which comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,…,d m Where C = { C = } 1 ,c 2 ,…,c n M is the number of texts, and n is the number of text categories; the method comprises the steps of 101) text preprocessing step, 102) classifier step, 103) testing and evaluating step and 104) adjusting step; the Chinese text classification method based on the classifier is more reasonable in modeling, improved in classification accuracy and recall rate, and accurate and fast in whole.

Description

Chinese text classification method based on classifier
The invention relates to a Chinese text classification method, which is a divisional application of a patent with the application number of 201910100095.7.
Technical Field
The invention relates to the field of text classification, in particular to a Chinese text classification method based on a classifier.
Background
In recent years, chemical accidents frequently occur, such as fire and explosion accidents in dangerous goods warehouses at the Ruihai International Logistics center in the New 8.12 Tianjin coastal New region and explosive leakage accidents of oil pipelines in the Qingdao 11.22 Shandong island bring about not only huge economic loss, but also casualties and environmental pollution, and some serious chemical accidents easily cause panic of people and have great influence on the society. If reports related to the chemical accidents can be quickly and accurately positioned by using a certain technology, convenience is provided for researching the cause of the chemical accidents, tracking the reports, preventing the occurrence of the chemical accidents and the like. Therefore, a technology for efficiently managing such information is required, which automatically classifies a large amount of text information and selects a specific field information text that is desired by people. The text classification technology can analyze and process a large amount of text data, greatly reduces manual intervention, can efficiently and accurately position specific information texts, and is an effective mode for processing various texts.
The development of information technology is increasingly rapid, internet technology is becoming mature, and the amount of data generated is explosively increasing, and most of the data is semi-structured and unstructured and is presented in text form. If a text is manually classified into a certain classification, although the classification result is accurate, the consumed manpower and material resources are extremely huge, the method cannot quickly adapt to the extremely rapid increase of information in the internet era and the requirements of social development, and the realization is very difficult. In fact, according to specific requirements, people often only care about a certain field of text information, and the rapid extraction of the specified text information plays a significant role in the development of internet technology.
The earliest report of text classification in China was that in the early 80 s, it was first systematically described by professor Hou Han, a major university in Nanjing. Subsequently, many scholars continuously improve the text classification method, and then our country makes great progress in the research of the text classification field. Li Xiaoli, shi Zhongzhi, etc. by introducing the concept inference network into text classification, the accuracy and recall rate of text classification are greatly improved. Jiang Yuan, zhou Zhihua, etc. propose in 2006 that word frequency is used as an influence factor during classification, li Rongliu of university of double denier adopts a classification method based on a maximum entropy model when constructing a text classifier, and Huang Jingjing, etc. adopt independent languages to perform wide expansion on text classification. However, there is no extremely high method of accurate classification as a whole. How to locate quickly and accurately is an important research category for information development in recent years.
Disclosure of Invention
The Chinese text classification method based on the classifier is more reasonable in modeling, higher in classification accuracy and recall rate, and is accurate and rapid as a whole.
The technical scheme of the invention is as follows:
a Chinese text classification method based on a classifier comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,L,d m Where C = { C = 1 ,c 2 ,L,c n M is the number of texts, n is the number of text categories, concretelyThe processing steps are as follows:
101 Text preprocessing step: performing text marking processing, word segmentation and stop word removal on the text of the training set, performing feature selection on the processed text through statistics, and performing feature dimension reduction to obtain a text category set C of the training set;
wherein, the statistics adopts the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;
the total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
Figure BDA0002700186230000031
From this, the feature item t and the class C can be obtained i The relevance value of (a) is:
Figure BDA0002700186230000032
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X 2 (t,c i ) The larger the value of (A), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantization value of class, CB represents the error determination that the document belongs to C according to the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
Figure BDA0002700186230000033
the statistical sorting is performed by sorting the average values from large to small, and a certain number of characteristic items are selected from the text category set C of the training set from large to small;
102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein the specific formula is as follows:
Figure BDA0002700186230000034
wherein, P (C) i |D j ) Text D representing a training set j Belonging to a certain class C i Probability of, document D j A set of participles x of the document may be used 1 ,x 2 ,…,x n Denotes, i.e. D j ={x 1 ,x 2 ,…,x n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is 1 ,x 2 ,L,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when the j value is different, P (C) is different i |D j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:
Figure BDA0002700186230000041
wherein x is j As a document D j N is n characteristic items; when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently performed under the same condition; by usingB xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
Figure BDA0002700186230000042
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, then only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is as follows:
Figure BDA0002700186230000043
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) The value of (2) is 0, so that a method of adding a smoothing factor needs to be adopted, and the following formula is obtained:
Figure BDA0002700186230000044
103 Testing and evaluation procedure): and evaluating the accuracy, the recall rate, the F1 value and the macro-average of the classifier, and adjusting the text category set C of the training set.
Further, the text marking process is to remove the chinese symbols, numbers and english in the text by using a regular expression, and the regular expression with the chinese symbols removed can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.
Further, adopting an MMSEG4J word segmentation toolkit to perform word segmentation; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.
Further, the accuracy, also called precision, is to obtain how many texts in the test set are correctly classified, which reflects the accuracy of the classifier classification, and is denoted as P, and the specific formula is as follows:
Figure BDA0002700186230000045
belong to class C i And the text number set A containing the feature item t, i.e. correctly classified to C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
Figure BDA0002700186230000051
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
Figure BDA0002700186230000052
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
Figure BDA0002700186230000053
Figure BDA0002700186230000054
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents class C i Accuracy of R i Represents a class C i The recall ratio of.
Compared with the prior art, the invention has the advantages that: the method establishes the characteristic items through the text training set, and evaluates the characteristic items through indexes such as accuracy, recall rate, F1 value, macro-average and the like, thereby training and adjusting the selected characteristic items. According to the method, the relationship degree quantization values are obtained and sorted through the relevance values of the feature items and the categories, and the appropriate feature items are selected as the classification standards, so that the accuracy, the recall rate and the precision are improved. The scheme of the invention provides possibility for high efficiency of text classification, and has high classification accuracy, high recall rate and accurate and rapid whole.
Drawings
FIG. 1 is a diagram of an overall model of the present invention;
FIG. 2 is a diagram of a text classification mapping model according to the present invention;
FIG. 3 is the original text in the training set of the present invention;
FIG. 4 is the text of FIG. 3 after text markup processing in accordance with the present invention;
FIG. 5 is a diagram of the present invention after the word segmentation process of FIG. 4;
FIG. 6 is the text of FIG. 5 with stop word processing removed in accordance with the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
As shown in fig. 1 to 6, a chinese text classification method based on a classifier includes a test set text D and a training set text category set C, and maps the test set text D to the training set text category set C by a text classification method; wherein D = { D = 1 ,d 2 ,L,d m Where C = { C = } 1 ,c 2 ,L,c n And m is the number of texts, and n is the number of text categories, the method specifically comprises the following steps:
101 Text preprocessing step: and carrying out text marking processing, word segmentation and stop word removal on the text of the training set. And (4) performing feature selection on the processed text through statistics, and performing feature dimensionality reduction to obtain a text category set C of the training set. The method comprises the following specific steps:
as shown in fig. 3, the original text in the training set contains special characters, numbers, etc. that do not contain text information, which do not help the classification of the text, and the text belonging to the noise data needs text labeling processing, and the regular expression is used to remove chinese symbols, numbers and english. The regular expression with Chinese symbols removed can be expressed as: [ \ \ u4e00 \ \ u9fa5\ w ], regular expressions excluding numbers and English are: [ a-zA-Z \ \ d ]. Resulting in the processed text as shown. In order to avoid the influence on the Chinese participle after removing the symbols, the symbols are replaced by blank spaces.
Except punctuations, no obvious separator marks exist in the Chinese text, so that the MMSEG4J word segmentation toolkit is adopted for segmenting words, and the Chinese text information is divided into words, which is a key step for processing the Chinese text information. I.e. the resulting word-segmented text as shown in fig. 5.
Words which appear in the text many times and are irrelevant to the content of the text are called stop words, such as 'o', 'but' and other fictional words, real words without practical meanings, conjunctions, mood-assisted words, prepositions, pronouns and the like, the words almost appear in each text, the words can be sorted into a stop word list, the words are deleted after the Chinese word segmentation is finished, and the obtained text is text information after the text preprocessing is finished. Wherein the deactivation word list may be directly derived from the hundred deactivation word list. The text after the stop word removal processing shown in fig. 6 is obtained.
Wherein, the statistics adopt the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;
in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the non-chemical engineering accident news report category, a =11, b =383, c =304, d =108, and p (non-chemical engineering) =0.391 are cases.
The total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
Figure BDA0002700186230000071
From this, the feature item t and the class C can be obtained i The relevance value of (a) is:
Figure BDA0002700186230000072
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X is 2 (t,c i ) The larger the value of (A), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
Figure BDA0002700186230000081
statistical ranking the mean values were ranked from big to small, from trainingAnd selecting a certain number of characteristic items from large to small in the exercise set text category set C. The result of each characteristic item t to be obtained
Figure BDA0002700186230000082
And arranging the feature words from large to small according to a selection sorting algorithm, and if the number of the feature words to be selected is 50, only selecting the first 50 feature words arranged from large to small. It may happen that the 50 th and 51 th calculated results are the same, and at this time, the results need to be evaluated and adjusted accordingly, and even if the results are different, the final evaluation may occur later than the first 50 ranked feature words. And therefore needs to be adjusted according to the evaluation. The method comprises the following specific steps:
Figure BDA0002700186230000083
Figure BDA0002700186230000084
Figure BDA0002700186230000085
as can be seen from equations (2-10) to (2-12), "fire-fighting
Figure BDA0002700186230000086
(fire fighting) is 426.37. And the same is true for other keywords, so that data can be obtained for arrangement, and a required number of feature words are selected as feature items of the text category set of the training set.
102 A classifier step: the data processed in the step 101) is processed by a text classifier, that is, taking the case that 300 feature words are selected after a news report text is processed in the step as an example, 128 words are obtained after the text is preprocessed, and 37 feature words are left in an article after statistical processing is adopted, so that the processing amount is greatly reduced, and the processing accuracy is improved. The specific formula is as follows:
Figure BDA0002700186230000091
wherein, P (C) i |D j ) Text D representing a training set j Belonging to a certain class C i Probability of, document D j A set of participles x of the document may be used 1 ,x 2 ,…,x n Denotes, i.e. D j ={x 1 ,x 2 ,…,x n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is 1 ,x 2 ,L,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when the j value is different, P (C) is different i |D j ) Magnitude relationship between values.
Therefore, the formula (3) can be finally expressed as:
Figure BDA0002700186230000092
wherein x is j As a document D j N is n characteristic items; when a certain feature item appears in the text, the weight is set to 1, if the feature item does not appear, the weight is set to 0, the test text is taken as an event, and the event is n-fold event, namely a random event which is repeatedly and independently performed under the same condition.
Taking a case as an example, the following results are obtained: p (C) i ) In order to be a priori at all,
Figure BDA0002700186230000093
for all feature items in class C i The product of the medium conditional probabilities is obtained by calculating the values of C (chemical engineering) and C (non-chemical engineering), and comparing the values, if C (chemical engineering)>C (non-chemical industry), the test news report text can be obtained to belong to the chemical accident news report category; otherwise, the data is in the non-chemical accident news category.
Class C i Prior summary ofThe rate can be expressed as:
Figure BDA0002700186230000094
with B xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
Figure BDA0002700186230000095
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is:
Figure BDA0002700186230000096
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) The value of (2) is 0, so that a method of adding a smoothing factor needs to be adopted, and the following formula is obtained:
Figure BDA0002700186230000097
in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the category of non-chemical accident news reports, a =11, b =383, c =304, d =108, p (non-chemical) =0.391 is an example. Taking 806 training set texts, 491 chemical accident news reports and 315 non-chemical accident news reports as examples, in the category of the chemical accident news reports, P (chemical engineering) =491/806=0.609; in the category of chemical accident news reports,p (non-chemical) =315/806=0.391. Taking the news report of fig. 3 as an example, the words after text processing are shown in fig. 5, t i For all the words in figure 5 of the drawings,
Figure BDA0002700186230000101
the test news report text belongs to a chemical accident news report.
103 Testing and evaluation procedure): and testing the classification performance of the text classifier by using the test set text, evaluating the accuracy, the recall rate, the comprehensive classification rate and the macro-average of the text classifier, and improving the classification performance.
The accuracy, also called precision, obtains how many texts in the test set are classified as correct, and reflects the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:
Figure BDA0002700186230000102
belong to class C i And the text number set A containing the feature item t, i.e. correctly classified to C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
Figure BDA0002700186230000103
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
Figure BDA0002700186230000104
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
Figure BDA0002700186230000111
Figure BDA0002700186230000112
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents class C i Accuracy of R i Represents class C i The recall ratio of.
Taking the chemical training set text as an example, the experimental data for comparing the common information gain method with the statistical method is as follows:
number of correctly classified texts Classifying erroneous text numbers Accuracy rate
Word selection method using this statistics 196 9 95.5%
Without using any word selection method 134 66 67%
TABLE 1 whether statistical vote difference comparison is used
Figure BDA0002700186230000113
TABLE 2 chemical accident Category test
Figure BDA0002700186230000114
TABLE 3 non-chemical accident Category test
As can be seen from the above table, the classification accuracy using the statistical method is significantly higher than the accuracy without using the statistical method. For chemical accident categories, the number of the selected feature words of the statistical method and the information gain feature selection method has almost no influence on the classification accuracy of the categories, the statistical method has higher accuracy which can reach more than 98%, and the information gain feature selection method has slightly lower processing accuracy. For the category of non-chemical accidents, the classification accuracy is high when the number of the feature words is 300, 500 and 1000, the statistical method can reach more than 89%, the information gain feature reflects the influence of the feature words, and although the influence of the feature words can reach more than 70%, the influence of the feature words is large, and the accuracy is high when the number of the feature words is more.
The text of the training set is checked, so that the situation that most of the texts of the chemical accident categories relate to leakage, fire, explosion, poisoning and the like can be found, and the classification accuracy of the chemical accident categories is high; and the text of the non-chemical accident category comprises news information in the fields of IT, military, education, sports, finance and the like, and the design field is wide. Most of texts with wrong classification of the test set of non-chemical accidents are fire drilling, chemical accident summary and the like, and are similar to the characteristics of the chemical accidents, so that the texts are classified into chemical accident categories during classification.
104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until the best effect is achieved. And the comparison data of the statistical table is the processing result of the characteristic words which are not adjusted. The adjusted processing result data is higher.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims (4)

1. A Chinese text classification method based on a classifier is characterized by comprising a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,…,d m Where C = { C = 1 ,c 2 ,…,c n And m is the number of texts, n is the number of text categories, and the specific processing steps are as follows:
101 Text preprocessing step: performing text marking processing, word segmentation and stop word removal on the text of the training set, performing feature selection on the processed text through statistics, and performing feature dimension reduction to obtain a text category set C of the training set;
wherein, the statistics adopt the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i In a text category set in which similar participles are removed after word segmentationI is a category mark, which is less than or equal to the number of participles after the participle; the characteristic item t is a specific word segmentation;
the total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
Figure FDA0004043420920000011
From this, the feature item t and the category C can be obtained i The relevance value of (a) is:
Figure FDA0004043420920000012
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X is 2 (t,c i ) The larger the value of (D), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
Figure FDA0004043420920000013
the statistical ranking is ranked from large to small according to the average value, and a certain number of characteristic items are selected from the text category set C of the training set from large to small according to the statistical ranking; the result of each characteristic item t to be obtained
Figure FDA0004043420920000021
Arranging the feature words from large to small according to a selection sorting algorithm, wherein the feature words appear, and selecting the nth feature wordThe result is the same as the result of the (n + 1) th calculation, the result needs to be evaluated at the moment, corresponding adjustment and exchange are carried out, and even if the result is different, the last evaluation of the result is higher than the first n feature words; therefore, the evaluation needs to be adjusted as follows:
Figure FDA0004043420920000022
Figure FDA0004043420920000023
Figure FDA0004043420920000024
as can be seen from equations (2-10) to (2-12), "fire-fighting
Figure FDA0004043420920000025
426.37; the same applies to other keywords, so that data can be obtained for arrangement, and the required number of feature words are selected as feature items of the text category set of the training set;
102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein a specific formula is as follows:
Figure FDA0004043420920000026
the denominator P (x) in the formula (3) 1 ,x 2 ,…,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when different j values are obtained, P (C) is different i |D j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:
Figure FDA0004043420920000027
when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently carried out under the same condition; with B xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
Figure FDA0004043420920000031
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, then only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is:
Figure FDA0004043420920000032
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) Is 0, so the method of adding the smoothing factor needs to be adopted, and the following formula is obtained:
Figure FDA0004043420920000033
103 Test and evaluation procedure: evaluating the accuracy, recall rate, F1 value and macro-average of the classifier, and adjusting a text category set C of the training set;
104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until a preset effect is achieved.
2. The Chinese text classification method based on classifier as claimed in claim 1, wherein: the text marking process is to remove Chinese symbols, numbers and English in the text by using a regular expression, and the regular expression for removing Chinese symbols can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.
3. The Chinese text classification method based on classifier as claimed in claim 1, wherein: performing word segmentation by adopting an MMSEG4J word segmentation toolkit; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.
4. The method of claim 1, wherein the method comprises the steps of:
accuracy, also called precision, obtains how many texts in the test set have correct classification results, embodies the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:
Figure FDA0004043420920000041
belong to class C i And the text number set A containing the feature item t, namely correctly classified into C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
Figure FDA0004043420920000042
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
Figure FDA0004043420920000043
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
Figure FDA0004043420920000044
Figure FDA0004043420920000045
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents a class C i Accuracy of R i Represents class C i The recall ratio of.
CN202011019598.0A 2019-01-31 2019-01-31 Chinese text classification method based on classifier Active CN112256865B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011019598.0A CN112256865B (en) 2019-01-31 2019-01-31 Chinese text classification method based on classifier

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910100095.7A CN109902173B (en) 2019-01-31 2019-01-31 Chinese text classification method
CN202011019598.0A CN112256865B (en) 2019-01-31 2019-01-31 Chinese text classification method based on classifier

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910100095.7A Division CN109902173B (en) 2019-01-31 2019-01-31 Chinese text classification method

Publications (2)

Publication Number Publication Date
CN112256865A CN112256865A (en) 2021-01-22
CN112256865B true CN112256865B (en) 2023-03-21

Family

ID=66944611

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202011019598.0A Active CN112256865B (en) 2019-01-31 2019-01-31 Chinese text classification method based on classifier
CN201910100095.7A Active CN109902173B (en) 2019-01-31 2019-01-31 Chinese text classification method

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910100095.7A Active CN109902173B (en) 2019-01-31 2019-01-31 Chinese text classification method

Country Status (1)

Country Link
CN (2) CN112256865B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798853A (en) * 2020-03-27 2020-10-20 北京京东尚科信息技术有限公司 Method, device, equipment and computer readable medium for speech recognition
CN112084308A (en) * 2020-09-16 2020-12-15 中国信息通信研究院 Method, system and storage medium for text type data recognition
CN112215002A (en) * 2020-11-02 2021-01-12 浙江大学 Electric power system text data classification method based on improved naive Bayes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN108509471A (en) * 2017-05-19 2018-09-07 苏州纯青智能科技有限公司 A kind of Chinese Text Categorization
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4713870B2 (en) * 2004-10-13 2011-06-29 ヒューレット−パッカード デベロップメント カンパニー エル.ピー. Document classification apparatus, method, and program
US8346534B2 (en) * 2008-11-06 2013-01-01 University of North Texas System Method, system and apparatus for automatic keyword extraction
CN101819601B (en) * 2010-05-11 2012-02-08 同方知网(北京)技术有限公司 Method for automatically classifying academic documents
CN104063399B (en) * 2013-03-22 2017-03-22 杭州娄文信息科技有限公司 Method and system for automatically identifying emotional probability borne by texts
CN105183831A (en) * 2015-08-31 2015-12-23 上海德唐数据科技有限公司 Text classification method for different subject topics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512311A (en) * 2015-12-14 2016-04-20 北京工业大学 Chi square statistic based self-adaption feature selection method
CN108509471A (en) * 2017-05-19 2018-09-07 苏州纯青智能科技有限公司 A kind of Chinese Text Categorization
CN109165294A (en) * 2018-08-21 2019-01-08 安徽讯飞智能科技有限公司 Short text classification method based on Bayesian classification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"中文文本分类特征选择方法的研究与实现";林艳峰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315;第I138-7803页 *
"基于朴素贝叶斯方法的中文文本分类研究";李丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111115;第I138-519页 *

Also Published As

Publication number Publication date
CN109902173A (en) 2019-06-18
CN109902173B (en) 2020-10-27
CN112256865A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN111104794B (en) Text similarity matching method based on subject term
CN107992633B (en) Automatic electronic document classification method and system based on keyword features
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN103514183B (en) Information search method and system based on interactive document clustering
US8380714B2 (en) Method, computer system, and computer program for searching document data using search keyword
CN102411563B (en) Method, device and system for identifying target words
CN100401302C (en) Image meaning automatic marking method based on marking significance sequence
CN112256865B (en) Chinese text classification method based on classifier
CN107291723A (en) The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN108197175B (en) Processing method and device of technical supervision data, storage medium and processor
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN110472203B (en) Article duplicate checking and detecting method, device, equipment and storage medium
CN101887415B (en) Automatic extraction method for text document theme word meaning
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
CN112579783B (en) Short text clustering method based on Laplace atlas
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network
CN109871429B (en) Short text retrieval method integrating Wikipedia classification and explicit semantic features
CN116881799A (en) Method for classifying cigarette production data
CN110888977B (en) Text classification method, apparatus, computer device and storage medium
CN114511027B (en) Method for extracting English remote data through big data network
CN115712720A (en) Rainfall dynamic early warning method based on knowledge graph
CN114547233A (en) Data duplicate checking method and device and electronic equipment
CN110413782B (en) Automatic table theme classification method and device, computer equipment and storage medium
CN113326688A (en) Ideological and political theory word duplication checking processing method and device
CN116414939B (en) Article generation method based on multidimensional data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant