CN112256865B

CN112256865B - Chinese text classification method based on classifier

Info

Publication number: CN112256865B
Application number: CN202011019598.0A
Authority: CN
Inventors: 陈卓
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-03-21
Anticipated expiration: 2039-01-31
Also published as: CN109902173A; CN109902173B; CN112256865A

Abstract

The invention discloses a Chinese text classification method based on a classifier, which comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = ₁ ,d ₂ ,…,d _m Where C = { C = } ₁ ,c ₂ ,…,c _n M is the number of texts, and n is the number of text categories; the method comprises the steps of 101) text preprocessing step, 102) classifier step, 103) testing and evaluating step and 104) adjusting step; the Chinese text classification method based on the classifier is more reasonable in modeling, improved in classification accuracy and recall rate, and accurate and fast in whole.

Description

Chinese text classification method based on classifier

The invention relates to a Chinese text classification method, which is a divisional application of a patent with the application number of 201910100095.7.

Technical Field

The invention relates to the field of text classification, in particular to a Chinese text classification method based on a classifier.

Background

In recent years, chemical accidents frequently occur, such as fire and explosion accidents in dangerous goods warehouses at the Ruihai International Logistics center in the New 8.12 Tianjin coastal New region and explosive leakage accidents of oil pipelines in the Qingdao 11.22 Shandong island bring about not only huge economic loss, but also casualties and environmental pollution, and some serious chemical accidents easily cause panic of people and have great influence on the society. If reports related to the chemical accidents can be quickly and accurately positioned by using a certain technology, convenience is provided for researching the cause of the chemical accidents, tracking the reports, preventing the occurrence of the chemical accidents and the like. Therefore, a technology for efficiently managing such information is required, which automatically classifies a large amount of text information and selects a specific field information text that is desired by people. The text classification technology can analyze and process a large amount of text data, greatly reduces manual intervention, can efficiently and accurately position specific information texts, and is an effective mode for processing various texts.

The development of information technology is increasingly rapid, internet technology is becoming mature, and the amount of data generated is explosively increasing, and most of the data is semi-structured and unstructured and is presented in text form. If a text is manually classified into a certain classification, although the classification result is accurate, the consumed manpower and material resources are extremely huge, the method cannot quickly adapt to the extremely rapid increase of information in the internet era and the requirements of social development, and the realization is very difficult. In fact, according to specific requirements, people often only care about a certain field of text information, and the rapid extraction of the specified text information plays a significant role in the development of internet technology.

The earliest report of text classification in China was that in the early 80 s, it was first systematically described by professor Hou Han, a major university in Nanjing. Subsequently, many scholars continuously improve the text classification method, and then our country makes great progress in the research of the text classification field. Li Xiaoli, shi Zhongzhi, etc. by introducing the concept inference network into text classification, the accuracy and recall rate of text classification are greatly improved. Jiang Yuan, zhou Zhihua, etc. propose in 2006 that word frequency is used as an influence factor during classification, li Rongliu of university of double denier adopts a classification method based on a maximum entropy model when constructing a text classifier, and Huang Jingjing, etc. adopt independent languages to perform wide expansion on text classification. However, there is no extremely high method of accurate classification as a whole. How to locate quickly and accurately is an important research category for information development in recent years.

Disclosure of Invention

The Chinese text classification method based on the classifier is more reasonable in modeling, higher in classification accuracy and recall rate, and is accurate and rapid as a whole.

The technical scheme of the invention is as follows:

a Chinese text classification method based on a classifier comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = ₁ ,d ₂ ,L,d _m Where C = { C = ₁ ,c ₂ ,L,c _n M is the number of texts, n is the number of text categories, concretelyThe processing steps are as follows:

101 Text preprocessing step: performing text marking processing, word segmentation and stop word removal on the text of the training set, performing feature selection on the processed text through statistics, and performing feature dimension reduction to obtain a text category set C of the training set;

wherein, the statistics adopts the characteristic item t and the category C _i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C _i And the text number set A containing the characteristic item t does not belong to the category C _i But the text number set B containing the feature item t belongs to the category C _i But the text number set C containing no feature item t does not belong to the category C _i And does not contain the text number set D of the characteristic item t; c _i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;

the total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C _i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as

From this, the feature item t and the class C can be obtained _i The relevance value of (a) is:

if the feature item t and the category C _i Independently of one another, AD-CB =0, with X ² (t,c _i ) =0; if X ² (t,c _i ) The larger the value of (A), the more the feature item t and the class C are indicated _i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t _i A quantization value of class, CB represents the error determination that the document belongs to C according to the characteristic item t _i A quantized value of a class;

the average value of the statistical ranking is used as comparison, and the average value is the following formula:

the statistical sorting is performed by sorting the average values from large to small, and a certain number of characteristic items are selected from the text category set C of the training set from large to small;

102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein the specific formula is as follows:

wherein, P (C) _i |D _j ) Text D representing a training set _j Belonging to a certain class C _i Probability of, document D _j A set of participles x of the document may be used ₁ ,x ₂ ,…,x _n Denotes, i.e. D _j ＝{x ₁ ,x ₂ ,…,x _n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is ₁ ,x ₂ ,L,x _n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) _i )P(x ₁ ,x ₂ ,…,x _n |C _i ) Can determine that when the j value is different, P (C) is different _i |D _j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:

wherein x is _j As a document D _j N is n characteristic items; when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently performed under the same condition; by usingB _xt Representing that the test document contains a text characteristic item t, the following formula is obtained:

in the category of class C _i In case of (2) x _j Probability of occurrence P (x) _j |C _i ) Meaning that if a feature item appears in the test text, then only P (x) needs to be obtained _j |C _i ) Otherwise, 1-P (x) is obtained _j |C _i )；

Conditional probability 1-P (x) _j |C _i ) The formula of (1) is as follows:

in the training set, if class C _i All texts in (1) do not contain the feature item x _j Then n is _ij Is 0, whereby P (x) _j |C _i ) The value of (2) is 0, so that a method of adding a smoothing factor needs to be adopted, and the following formula is obtained:

103 Testing and evaluation procedure): and evaluating the accuracy, the recall rate, the F1 value and the macro-average of the classifier, and adjusting the text category set C of the training set.

Further, the text marking process is to remove the chinese symbols, numbers and english in the text by using a regular expression, and the regular expression with the chinese symbols removed can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.

Further, adopting an MMSEG4J word segmentation toolkit to perform word segmentation; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.

Further, the accuracy, also called precision, is to obtain how many texts in the test set are correctly classified, which reflects the accuracy of the classifier classification, and is denoted as P, and the specific formula is as follows:

belong to class C _i And the text number set A containing the feature item t, i.e. correctly classified to C _i The number of texts of the class; not belonging to class C _i But the text number set B containing the characteristic item t, A + B is actually classified into C _i Total number of texts of class;

recall, also known as recall, and acquisition tests focused on category C _i Can be correctly classified into the category C _i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:

belong to class C _i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C _i The text of the class;

the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:

the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:

wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P _i Represents class C _i Accuracy of R _i Represents a class C _i The recall ratio of.

Compared with the prior art, the invention has the advantages that: the method establishes the characteristic items through the text training set, and evaluates the characteristic items through indexes such as accuracy, recall rate, F1 value, macro-average and the like, thereby training and adjusting the selected characteristic items. According to the method, the relationship degree quantization values are obtained and sorted through the relevance values of the feature items and the categories, and the appropriate feature items are selected as the classification standards, so that the accuracy, the recall rate and the precision are improved. The scheme of the invention provides possibility for high efficiency of text classification, and has high classification accuracy, high recall rate and accurate and rapid whole.

Drawings

FIG. 1 is a diagram of an overall model of the present invention;

FIG. 2 is a diagram of a text classification mapping model according to the present invention;

FIG. 3 is the original text in the training set of the present invention;

FIG. 4 is the text of FIG. 3 after text markup processing in accordance with the present invention;

FIG. 5 is a diagram of the present invention after the word segmentation process of FIG. 4;

FIG. 6 is the text of FIG. 5 with stop word processing removed in accordance with the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

As shown in fig. 1 to 6, a chinese text classification method based on a classifier includes a test set text D and a training set text category set C, and maps the test set text D to the training set text category set C by a text classification method; wherein D = { D = ₁ ,d ₂ ,L,d _m Where C = { C = } ₁ ,c ₂ ,L,c _n And m is the number of texts, and n is the number of text categories, the method specifically comprises the following steps:

101 Text preprocessing step: and carrying out text marking processing, word segmentation and stop word removal on the text of the training set. And (4) performing feature selection on the processed text through statistics, and performing feature dimensionality reduction to obtain a text category set C of the training set. The method comprises the following specific steps:

as shown in fig. 3, the original text in the training set contains special characters, numbers, etc. that do not contain text information, which do not help the classification of the text, and the text belonging to the noise data needs text labeling processing, and the regular expression is used to remove chinese symbols, numbers and english. The regular expression with Chinese symbols removed can be expressed as: [ \ \ u4e00 \ \ u9fa5\ w ], regular expressions excluding numbers and English are: [ a-zA-Z \ \ d ]. Resulting in the processed text as shown. In order to avoid the influence on the Chinese participle after removing the symbols, the symbols are replaced by blank spaces.

Except punctuations, no obvious separator marks exist in the Chinese text, so that the MMSEG4J word segmentation toolkit is adopted for segmenting words, and the Chinese text information is divided into words, which is a key step for processing the Chinese text information. I.e. the resulting word-segmented text as shown in fig. 5.

Words which appear in the text many times and are irrelevant to the content of the text are called stop words, such as 'o', 'but' and other fictional words, real words without practical meanings, conjunctions, mood-assisted words, prepositions, pronouns and the like, the words almost appear in each text, the words can be sorted into a stop word list, the words are deleted after the Chinese word segmentation is finished, and the obtained text is text information after the text preprocessing is finished. Wherein the deactivation word list may be directly derived from the hundred deactivation word list. The text after the stop word removal processing shown in fig. 6 is obtained.

Wherein, the statistics adopt the characteristic item t and the category C _i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C _i And the text number set A containing the characteristic item t does not belong to the category C _i But the text number set B containing the feature item t belongs to the category C _i But the text number set C containing no feature item t does not belong to the category C _i And does not contain the text number set D of the characteristic item t; c _i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;

in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the non-chemical engineering accident news report category, a =11, b =383, c =304, d =108, and p (non-chemical engineering) =0.391 are cases.

if the feature item t and the category C _i Independently of one another, AD-CB =0, with X ² (t,c _i ) =0; if X is ² (t,c _i ) The larger the value of (A), the more the feature item t and the class C are indicated _i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t _i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t _i A quantized value of a class;

statistical ranking the mean values were ranked from big to small, from trainingAnd selecting a certain number of characteristic items from large to small in the exercise set text category set C. The result of each characteristic item t to be obtained

And arranging the feature words from large to small according to a selection sorting algorithm, and if the number of the feature words to be selected is 50, only selecting the first 50 feature words arranged from large to small. It may happen that the 50 th and 51 th calculated results are the same, and at this time, the results need to be evaluated and adjusted accordingly, and even if the results are different, the final evaluation may occur later than the first 50 ranked feature words. And therefore needs to be adjusted according to the evaluation. The method comprises the following specific steps:

as can be seen from equations (2-10) to (2-12), "fire-fighting

(fire fighting) is 426.37. And the same is true for other keywords, so that data can be obtained for arrangement, and a required number of feature words are selected as feature items of the text category set of the training set.

102 A classifier step: the data processed in the step 101) is processed by a text classifier, that is, taking the case that 300 feature words are selected after a news report text is processed in the step as an example, 128 words are obtained after the text is preprocessed, and 37 feature words are left in an article after statistical processing is adopted, so that the processing amount is greatly reduced, and the processing accuracy is improved. The specific formula is as follows:

wherein, P (C) _i |D _j ) Text D representing a training set _j Belonging to a certain class C _i Probability of, document D _j A set of participles x of the document may be used ₁ ,x ₂ ,…,x _n Denotes, i.e. D _j ＝{x ₁ ,x ₂ ,…,x _n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is ₁ ,x ₂ ,L,x _n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) _i )P(x ₁ ,x ₂ ,…,x _n |C _i ) Can determine that when the j value is different, P (C) is different _i |D _j ) Magnitude relationship between values.

Therefore, the formula (3) can be finally expressed as:

wherein x is _j As a document D _j N is n characteristic items; when a certain feature item appears in the text, the weight is set to 1, if the feature item does not appear, the weight is set to 0, the test text is taken as an event, and the event is n-fold event, namely a random event which is repeatedly and independently performed under the same condition.

Taking a case as an example, the following results are obtained: p (C) _i ) In order to be a priori at all,

for all feature items in class C _i The product of the medium conditional probabilities is obtained by calculating the values of C (chemical engineering) and C (non-chemical engineering), and comparing the values, if C (chemical engineering)>C (non-chemical industry), the test news report text can be obtained to belong to the chemical accident news report category; otherwise, the data is in the non-chemical accident news category.

Class C _i Prior summary ofThe rate can be expressed as:

with B _xt Representing that the test document contains a text characteristic item t, the following formula is obtained:

in the category of class C _i In case of (2) x _j Probability of occurrence P (x) _j |C _i ) Meaning that if a feature item appears in the test text, only P (x) needs to be obtained _j |C _i ) Otherwise, 1-P (x) is obtained _j |C _i )；

Conditional probability 1-P (x) _j |C _i ) The formula of (1) is:

in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the category of non-chemical accident news reports, a =11, b =383, c =304, d =108, p (non-chemical) =0.391 is an example. Taking 806 training set texts, 491 chemical accident news reports and 315 non-chemical accident news reports as examples, in the category of the chemical accident news reports, P (chemical engineering) =491/806=0.609; in the category of chemical accident news reports,p (non-chemical) =315/806=0.391. Taking the news report of fig. 3 as an example, the words after text processing are shown in fig. 5, t _i For all the words in figure 5 of the drawings,

the test news report text belongs to a chemical accident news report.

103 Testing and evaluation procedure): and testing the classification performance of the text classifier by using the test set text, evaluating the accuracy, the recall rate, the comprehensive classification rate and the macro-average of the text classifier, and improving the classification performance.

The accuracy, also called precision, obtains how many texts in the test set are classified as correct, and reflects the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:

wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P _i Represents class C _i Accuracy of R _i Represents class C _i The recall ratio of.

Taking the chemical training set text as an example, the experimental data for comparing the common information gain method with the statistical method is as follows:

	number of correctly classified texts	Classifying erroneous text numbers	Accuracy rate
				Word selection method using this statistics	196	9	95.5％
Without using any word selection method	134	66	67％

TABLE 1 whether statistical vote difference comparison is used

TABLE 2 chemical accident Category test

TABLE 3 non-chemical accident Category test

As can be seen from the above table, the classification accuracy using the statistical method is significantly higher than the accuracy without using the statistical method. For chemical accident categories, the number of the selected feature words of the statistical method and the information gain feature selection method has almost no influence on the classification accuracy of the categories, the statistical method has higher accuracy which can reach more than 98%, and the information gain feature selection method has slightly lower processing accuracy. For the category of non-chemical accidents, the classification accuracy is high when the number of the feature words is 300, 500 and 1000, the statistical method can reach more than 89%, the information gain feature reflects the influence of the feature words, and although the influence of the feature words can reach more than 70%, the influence of the feature words is large, and the accuracy is high when the number of the feature words is more.

The text of the training set is checked, so that the situation that most of the texts of the chemical accident categories relate to leakage, fire, explosion, poisoning and the like can be found, and the classification accuracy of the chemical accident categories is high; and the text of the non-chemical accident category comprises news information in the fields of IT, military, education, sports, finance and the like, and the design field is wide. Most of texts with wrong classification of the test set of non-chemical accidents are fire drilling, chemical accident summary and the like, and are similar to the characteristics of the chemical accidents, so that the texts are classified into chemical accident categories during classification.

104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until the best effect is achieved. And the comparison data of the statistical table is the processing result of the characteristic words which are not adjusted. The adjusted processing result data is higher.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims

1. A Chinese text classification method based on a classifier is characterized by comprising a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = ₁ ,d ₂ ,…,d _m Where C = { C = ₁ ,c ₂ ,…,c _n And m is the number of texts, n is the number of text categories, and the specific processing steps are as follows:

wherein, the statistics adopt the characteristic item t and the category C _i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C _i And the text number set A containing the characteristic item t does not belong to the category C _i But the text number set B containing the feature item t belongs to the category C _i But the text number set C containing no feature item t does not belong to the category C _i And does not contain the text number set D of the characteristic item t; c _i In a text category set in which similar participles are removed after word segmentationI is a category mark, which is less than or equal to the number of participles after the participle; the characteristic item t is a specific word segmentation;

From this, the feature item t and the category C can be obtained _i The relevance value of (a) is:

if the feature item t and the category C _i Independently of one another, AD-CB =0, with X ² (t,c _i ) =0; if X is ² (t,c _i ) The larger the value of (D), the more the feature item t and the class C are indicated _i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t _i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t _i A quantized value of a class;

the statistical ranking is ranked from large to small according to the average value, and a certain number of characteristic items are selected from the text category set C of the training set from large to small according to the statistical ranking; the result of each characteristic item t to be obtained

Arranging the feature words from large to small according to a selection sorting algorithm, wherein the feature words appear, and selecting the nth feature wordThe result is the same as the result of the (n + 1) th calculation, the result needs to be evaluated at the moment, corresponding adjustment and exchange are carried out, and even if the result is different, the last evaluation of the result is higher than the first n feature words; therefore, the evaluation needs to be adjusted as follows:

as can be seen from equations (2-10) to (2-12), "fire-fighting

426.37; the same applies to other keywords, so that data can be obtained for arrangement, and the required number of feature words are selected as feature items of the text category set of the training set;

102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein a specific formula is as follows:

the denominator P (x) in the formula (3) ₁ ,x ₂ ,…,x _n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) _i )P(x ₁ ,x ₂ ,…,x _n |C _i ) Can determine that when different j values are obtained, P (C) is different _i |D _j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:

when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently carried out under the same condition; with B _xt Representing that the test document contains a text characteristic item t, the following formula is obtained:

Conditional probability 1-P (x) _j |C _i ) The formula of (1) is:

in the training set, if class C _i All texts in (1) do not contain the feature item x _j Then n is _ij Is 0, whereby P (x) _j |C _i ) Is 0, so the method of adding the smoothing factor needs to be adopted, and the following formula is obtained:

103 Test and evaluation procedure: evaluating the accuracy, recall rate, F1 value and macro-average of the classifier, and adjusting a text category set C of the training set;

104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until a preset effect is achieved.

2. The Chinese text classification method based on classifier as claimed in claim 1, wherein: the text marking process is to remove Chinese symbols, numbers and English in the text by using a regular expression, and the regular expression for removing Chinese symbols can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.

3. The Chinese text classification method based on classifier as claimed in claim 1, wherein: performing word segmentation by adopting an MMSEG4J word segmentation toolkit; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.

4. The method of claim 1, wherein the method comprises the steps of:

accuracy, also called precision, obtains how many texts in the test set have correct classification results, embodies the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:

belong to class C _i And the text number set A containing the feature item t, namely correctly classified into C _i The number of texts of the class; not belonging to class C _i But the text number set B containing the characteristic item t, A + B is actually classified into C _i Total number of texts of class;

wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P _i Represents a class C _i Accuracy of R _i Represents class C _i The recall ratio of.