CN112256865B - Chinese text classification method based on classifier - Google Patents
Chinese text classification method based on classifier Download PDFInfo
- Publication number
- CN112256865B CN112256865B CN202011019598.0A CN202011019598A CN112256865B CN 112256865 B CN112256865 B CN 112256865B CN 202011019598 A CN202011019598 A CN 202011019598A CN 112256865 B CN112256865 B CN 112256865B
- Authority
- CN
- China
- Prior art keywords
- text
- category
- feature
- class
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a Chinese text classification method based on a classifier, which comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,…,d m Where C = { C = } 1 ,c 2 ,…,c n M is the number of texts, and n is the number of text categories; the method comprises the steps of 101) text preprocessing step, 102) classifier step, 103) testing and evaluating step and 104) adjusting step; the Chinese text classification method based on the classifier is more reasonable in modeling, improved in classification accuracy and recall rate, and accurate and fast in whole.
Description
The invention relates to a Chinese text classification method, which is a divisional application of a patent with the application number of 201910100095.7.
Technical Field
The invention relates to the field of text classification, in particular to a Chinese text classification method based on a classifier.
Background
In recent years, chemical accidents frequently occur, such as fire and explosion accidents in dangerous goods warehouses at the Ruihai International Logistics center in the New 8.12 Tianjin coastal New region and explosive leakage accidents of oil pipelines in the Qingdao 11.22 Shandong island bring about not only huge economic loss, but also casualties and environmental pollution, and some serious chemical accidents easily cause panic of people and have great influence on the society. If reports related to the chemical accidents can be quickly and accurately positioned by using a certain technology, convenience is provided for researching the cause of the chemical accidents, tracking the reports, preventing the occurrence of the chemical accidents and the like. Therefore, a technology for efficiently managing such information is required, which automatically classifies a large amount of text information and selects a specific field information text that is desired by people. The text classification technology can analyze and process a large amount of text data, greatly reduces manual intervention, can efficiently and accurately position specific information texts, and is an effective mode for processing various texts.
The development of information technology is increasingly rapid, internet technology is becoming mature, and the amount of data generated is explosively increasing, and most of the data is semi-structured and unstructured and is presented in text form. If a text is manually classified into a certain classification, although the classification result is accurate, the consumed manpower and material resources are extremely huge, the method cannot quickly adapt to the extremely rapid increase of information in the internet era and the requirements of social development, and the realization is very difficult. In fact, according to specific requirements, people often only care about a certain field of text information, and the rapid extraction of the specified text information plays a significant role in the development of internet technology.
The earliest report of text classification in China was that in the early 80 s, it was first systematically described by professor Hou Han, a major university in Nanjing. Subsequently, many scholars continuously improve the text classification method, and then our country makes great progress in the research of the text classification field. Li Xiaoli, shi Zhongzhi, etc. by introducing the concept inference network into text classification, the accuracy and recall rate of text classification are greatly improved. Jiang Yuan, zhou Zhihua, etc. propose in 2006 that word frequency is used as an influence factor during classification, li Rongliu of university of double denier adopts a classification method based on a maximum entropy model when constructing a text classifier, and Huang Jingjing, etc. adopt independent languages to perform wide expansion on text classification. However, there is no extremely high method of accurate classification as a whole. How to locate quickly and accurately is an important research category for information development in recent years.
Disclosure of Invention
The Chinese text classification method based on the classifier is more reasonable in modeling, higher in classification accuracy and recall rate, and is accurate and rapid as a whole.
The technical scheme of the invention is as follows:
a Chinese text classification method based on a classifier comprises a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,L,d m Where C = { C = 1 ,c 2 ,L,c n M is the number of texts, n is the number of text categories, concretelyThe processing steps are as follows:
101 Text preprocessing step: performing text marking processing, word segmentation and stop word removal on the text of the training set, performing feature selection on the processed text through statistics, and performing feature dimension reduction to obtain a text category set C of the training set;
wherein, the statistics adopts the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;
the total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
From this, the feature item t and the class C can be obtained i The relevance value of (a) is:
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X 2 (t,c i ) The larger the value of (A), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantization value of class, CB represents the error determination that the document belongs to C according to the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
the statistical sorting is performed by sorting the average values from large to small, and a certain number of characteristic items are selected from the text category set C of the training set from large to small;
102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein the specific formula is as follows:
wherein, P (C) i |D j ) Text D representing a training set j Belonging to a certain class C i Probability of, document D j A set of participles x of the document may be used 1 ,x 2 ,…,x n Denotes, i.e. D j ={x 1 ,x 2 ,…,x n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is 1 ,x 2 ,L,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when the j value is different, P (C) is different i |D j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:
wherein x is j As a document D j N is n characteristic items; when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently performed under the same condition; by usingB xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, then only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is as follows:
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) The value of (2) is 0, so that a method of adding a smoothing factor needs to be adopted, and the following formula is obtained:
103 Testing and evaluation procedure): and evaluating the accuracy, the recall rate, the F1 value and the macro-average of the classifier, and adjusting the text category set C of the training set.
Further, the text marking process is to remove the chinese symbols, numbers and english in the text by using a regular expression, and the regular expression with the chinese symbols removed can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.
Further, adopting an MMSEG4J word segmentation toolkit to perform word segmentation; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.
Further, the accuracy, also called precision, is to obtain how many texts in the test set are correctly classified, which reflects the accuracy of the classifier classification, and is denoted as P, and the specific formula is as follows:
belong to class C i And the text number set A containing the feature item t, i.e. correctly classified to C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents class C i Accuracy of R i Represents a class C i The recall ratio of.
Compared with the prior art, the invention has the advantages that: the method establishes the characteristic items through the text training set, and evaluates the characteristic items through indexes such as accuracy, recall rate, F1 value, macro-average and the like, thereby training and adjusting the selected characteristic items. According to the method, the relationship degree quantization values are obtained and sorted through the relevance values of the feature items and the categories, and the appropriate feature items are selected as the classification standards, so that the accuracy, the recall rate and the precision are improved. The scheme of the invention provides possibility for high efficiency of text classification, and has high classification accuracy, high recall rate and accurate and rapid whole.
Drawings
FIG. 1 is a diagram of an overall model of the present invention;
FIG. 2 is a diagram of a text classification mapping model according to the present invention;
FIG. 3 is the original text in the training set of the present invention;
FIG. 4 is the text of FIG. 3 after text markup processing in accordance with the present invention;
FIG. 5 is a diagram of the present invention after the word segmentation process of FIG. 4;
FIG. 6 is the text of FIG. 5 with stop word processing removed in accordance with the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
As shown in fig. 1 to 6, a chinese text classification method based on a classifier includes a test set text D and a training set text category set C, and maps the test set text D to the training set text category set C by a text classification method; wherein D = { D = 1 ,d 2 ,L,d m Where C = { C = } 1 ,c 2 ,L,c n And m is the number of texts, and n is the number of text categories, the method specifically comprises the following steps:
101 Text preprocessing step: and carrying out text marking processing, word segmentation and stop word removal on the text of the training set. And (4) performing feature selection on the processed text through statistics, and performing feature dimensionality reduction to obtain a text category set C of the training set. The method comprises the following specific steps:
as shown in fig. 3, the original text in the training set contains special characters, numbers, etc. that do not contain text information, which do not help the classification of the text, and the text belonging to the noise data needs text labeling processing, and the regular expression is used to remove chinese symbols, numbers and english. The regular expression with Chinese symbols removed can be expressed as: [ \ \ u4e00 \ \ u9fa5\ w ], regular expressions excluding numbers and English are: [ a-zA-Z \ \ d ]. Resulting in the processed text as shown. In order to avoid the influence on the Chinese participle after removing the symbols, the symbols are replaced by blank spaces.
Except punctuations, no obvious separator marks exist in the Chinese text, so that the MMSEG4J word segmentation toolkit is adopted for segmenting words, and the Chinese text information is divided into words, which is a key step for processing the Chinese text information. I.e. the resulting word-segmented text as shown in fig. 5.
Words which appear in the text many times and are irrelevant to the content of the text are called stop words, such as 'o', 'but' and other fictional words, real words without practical meanings, conjunctions, mood-assisted words, prepositions, pronouns and the like, the words almost appear in each text, the words can be sorted into a stop word list, the words are deleted after the Chinese word segmentation is finished, and the obtained text is text information after the text preprocessing is finished. Wherein the deactivation word list may be directly derived from the hundred deactivation word list. The text after the stop word removal processing shown in fig. 6 is obtained.
Wherein, the statistics adopt the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i Representing one category in the text category set with the similar participles removed after the participles are segmented, wherein i is a category mark which is less than or equal to the number of the participles after the participles are segmented; the characteristic item t is a specific word segmentation;
in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the non-chemical engineering accident news report category, a =11, b =383, c =304, d =108, and p (non-chemical engineering) =0.391 are cases.
The total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
From this, the feature item t and the class C can be obtained i The relevance value of (a) is:
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X is 2 (t,c i ) The larger the value of (A), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
statistical ranking the mean values were ranked from big to small, from trainingAnd selecting a certain number of characteristic items from large to small in the exercise set text category set C. The result of each characteristic item t to be obtainedAnd arranging the feature words from large to small according to a selection sorting algorithm, and if the number of the feature words to be selected is 50, only selecting the first 50 feature words arranged from large to small. It may happen that the 50 th and 51 th calculated results are the same, and at this time, the results need to be evaluated and adjusted accordingly, and even if the results are different, the final evaluation may occur later than the first 50 ranked feature words. And therefore needs to be adjusted according to the evaluation. The method comprises the following specific steps:
as can be seen from equations (2-10) to (2-12), "fire-fighting(fire fighting) is 426.37. And the same is true for other keywords, so that data can be obtained for arrangement, and a required number of feature words are selected as feature items of the text category set of the training set.
102 A classifier step: the data processed in the step 101) is processed by a text classifier, that is, taking the case that 300 feature words are selected after a news report text is processed in the step as an example, 128 words are obtained after the text is preprocessed, and 37 feature words are left in an article after statistical processing is adopted, so that the processing amount is greatly reduced, and the processing accuracy is improved. The specific formula is as follows:
wherein, P (C) i |D j ) Text D representing a training set j Belonging to a certain class C i Probability of, document D j A set of participles x of the document may be used 1 ,x 2 ,…,x n Denotes, i.e. D j ={x 1 ,x 2 ,…,x n Since the number of occurrences of a fixed feature word in a text set is constant, the denominator P (x) in formula (3) is 1 ,x 2 ,L,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when the j value is different, P (C) is different i |D j ) Magnitude relationship between values.
Therefore, the formula (3) can be finally expressed as:
wherein x is j As a document D j N is n characteristic items; when a certain feature item appears in the text, the weight is set to 1, if the feature item does not appear, the weight is set to 0, the test text is taken as an event, and the event is n-fold event, namely a random event which is repeatedly and independently performed under the same condition.
Taking a case as an example, the following results are obtained: p (C) i ) In order to be a priori at all,for all feature items in class C i The product of the medium conditional probabilities is obtained by calculating the values of C (chemical engineering) and C (non-chemical engineering), and comparing the values, if C (chemical engineering)>C (non-chemical industry), the test news report text can be obtained to belong to the chemical accident news report category; otherwise, the data is in the non-chemical accident news category.
Class C i Prior summary ofThe rate can be expressed as:
with B xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is:
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) The value of (2) is 0, so that a method of adding a smoothing factor needs to be adopted, and the following formula is obtained:
in the training set text total number N =806, a + b =394, in the chemical accident news report category, a =383, b =11, c =108, d =304, p (chemical industry) =0.609; in the category of non-chemical accident news reports, a =11, b =383, c =304, d =108, p (non-chemical) =0.391 is an example. Taking 806 training set texts, 491 chemical accident news reports and 315 non-chemical accident news reports as examples, in the category of the chemical accident news reports, P (chemical engineering) =491/806=0.609; in the category of chemical accident news reports,p (non-chemical) =315/806=0.391. Taking the news report of fig. 3 as an example, the words after text processing are shown in fig. 5, t i For all the words in figure 5 of the drawings,the test news report text belongs to a chemical accident news report.
103 Testing and evaluation procedure): and testing the classification performance of the text classifier by using the test set text, evaluating the accuracy, the recall rate, the comprehensive classification rate and the macro-average of the text classifier, and improving the classification performance.
The accuracy, also called precision, obtains how many texts in the test set are classified as correct, and reflects the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:
belong to class C i And the text number set A containing the feature item t, i.e. correctly classified to C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents class C i Accuracy of R i Represents class C i The recall ratio of.
Taking the chemical training set text as an example, the experimental data for comparing the common information gain method with the statistical method is as follows:
number of correctly classified texts | Classifying erroneous text numbers | Accuracy rate | |
Word selection method using this statistics | 196 | 9 | 95.5% |
Without using any word selection method | 134 | 66 | 67% |
TABLE 1 whether statistical vote difference comparison is used
TABLE 2 chemical accident Category test
TABLE 3 non-chemical accident Category test
As can be seen from the above table, the classification accuracy using the statistical method is significantly higher than the accuracy without using the statistical method. For chemical accident categories, the number of the selected feature words of the statistical method and the information gain feature selection method has almost no influence on the classification accuracy of the categories, the statistical method has higher accuracy which can reach more than 98%, and the information gain feature selection method has slightly lower processing accuracy. For the category of non-chemical accidents, the classification accuracy is high when the number of the feature words is 300, 500 and 1000, the statistical method can reach more than 89%, the information gain feature reflects the influence of the feature words, and although the influence of the feature words can reach more than 70%, the influence of the feature words is large, and the accuracy is high when the number of the feature words is more.
The text of the training set is checked, so that the situation that most of the texts of the chemical accident categories relate to leakage, fire, explosion, poisoning and the like can be found, and the classification accuracy of the chemical accident categories is high; and the text of the non-chemical accident category comprises news information in the fields of IT, military, education, sports, finance and the like, and the design field is wide. Most of texts with wrong classification of the test set of non-chemical accidents are fire drilling, chemical accident summary and the like, and are similar to the characteristics of the chemical accidents, so that the texts are classified into chemical accident categories during classification.
104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until the best effect is achieved. And the comparison data of the statistical table is the processing result of the characteristic words which are not adjusted. The adjusted processing result data is higher.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.
Claims (4)
1. A Chinese text classification method based on a classifier is characterized by comprising a test set text D and a text category set C of a training set, wherein the test set text D is mapped to the text category set C of the training set through a text classification method; wherein D = { D = 1 ,d 2 ,…,d m Where C = { C = 1 ,c 2 ,…,c n And m is the number of texts, n is the number of text categories, and the specific processing steps are as follows:
101 Text preprocessing step: performing text marking processing, word segmentation and stop word removal on the text of the training set, performing feature selection on the processed text through statistics, and performing feature dimension reduction to obtain a text category set C of the training set;
wherein, the statistics adopt the characteristic item t and the category C i The correlation of (2) is subjected to ranking statistics, and specifically comprises four statistics: belong to class C i And the text number set A containing the characteristic item t does not belong to the category C i But the text number set B containing the feature item t belongs to the category C i But the text number set C containing no feature item t does not belong to the category C i And does not contain the text number set D of the characteristic item t; c i In a text category set in which similar participles are removed after word segmentationI is a category mark, which is less than or equal to the number of participles after the participle; the characteristic item t is a specific word segmentation;
the total number set of texts containing the feature item t in the training set is A + B, the total number set of texts containing no feature item t is C + D, and the category C i The text number set of (2) is A + C, the text number set of other categories is B + D, the total text number set of the training set is N, and N = A + B + C + D, the probability of the feature item t is represented as
From this, the feature item t and the category C can be obtained i The relevance value of (a) is:
if the feature item t and the category C i Independently of one another, AD-CB =0, with X 2 (t,c i ) =0; if X is 2 (t,c i ) The larger the value of (D), the more the feature item t and the class C are indicated i The greater the degree of correlation; AD represents that the document is correctly judged to belong to C according to the characteristic item t i A quantized value of class, CB, indicating that the document belongs to C based on the error determination of the characteristic item t i A quantized value of a class;
the average value of the statistical ranking is used as comparison, and the average value is the following formula:
the statistical ranking is ranked from large to small according to the average value, and a certain number of characteristic items are selected from the text category set C of the training set from large to small according to the statistical ranking; the result of each characteristic item t to be obtainedArranging the feature words from large to small according to a selection sorting algorithm, wherein the feature words appear, and selecting the nth feature wordThe result is the same as the result of the (n + 1) th calculation, the result needs to be evaluated at the moment, corresponding adjustment and exchange are carried out, and even if the result is different, the last evaluation of the result is higher than the first n feature words; therefore, the evaluation needs to be adjusted as follows:
as can be seen from equations (2-10) to (2-12), "fire-fighting426.37; the same applies to other keywords, so that data can be obtained for arrangement, and the required number of feature words are selected as feature items of the text category set of the training set;
102 A classifier step: processing the data processed in the step 101) by a text classifier, wherein a specific formula is as follows:
the denominator P (x) in the formula (3) 1 ,x 2 ,…,x n ) Is constant, so that it is only necessary to obtain the molecule P (C) in the formula (3) i )P(x 1 ,x 2 ,…,x n |C i ) Can determine that when different j values are obtained, P (C) is different i |D j ) Magnitude relationship between values; therefore, the formula (3) can be finally expressed as:
when a certain feature item appears in the text, setting the weight to be 1, if the certain feature item does not appear, setting the weight to be 0, taking the test text as an event, and taking the event as n-fold event, namely a random event which is repeatedly and independently carried out under the same condition; with B xt Representing that the test document contains a text characteristic item t, the following formula is obtained:
in the category of class C i In case of (2) x j Probability of occurrence P (x) j |C i ) Meaning that if a feature item appears in the test text, then only P (x) needs to be obtained j |C i ) Otherwise, 1-P (x) is obtained j |C i );
Conditional probability 1-P (x) j |C i ) The formula of (1) is:
in the training set, if class C i All texts in (1) do not contain the feature item x j Then n is ij Is 0, whereby P (x) j |C i ) Is 0, so the method of adding the smoothing factor needs to be adopted, and the following formula is obtained:
103 Test and evaluation procedure: evaluating the accuracy, recall rate, F1 value and macro-average of the classifier, and adjusting a text category set C of the training set;
104 ) an adjustment step: and adjusting the selected characteristic items according to the evaluation result of the step 103), and carrying out testing evaluation again until a preset effect is achieved.
2. The Chinese text classification method based on classifier as claimed in claim 1, wherein: the text marking process is to remove Chinese symbols, numbers and English in the text by using a regular expression, and the regular expression for removing Chinese symbols can be expressed as: [ \\ u4e00- \ \ u9fa5\ w ], the regular expression excluding numbers and english is: [ a-zA-Z \ \ d ], and replaced with a blank space.
3. The Chinese text classification method based on classifier as claimed in claim 1, wherein: performing word segmentation by adopting an MMSEG4J word segmentation toolkit; the stop words are words which appear in the text for many times and are irrelevant to the text content, are sorted into a stop word list and are deleted after the word segmentation is finished.
4. The method of claim 1, wherein the method comprises the steps of:
accuracy, also called precision, obtains how many texts in the test set have correct classification results, embodies the accuracy of classifier classification, and is marked as P, and the specific formula is as follows:
belong to class C i And the text number set A containing the feature item t, namely correctly classified into C i The number of texts of the class; not belonging to class C i But the text number set B containing the characteristic item t, A + B is actually classified into C i Total number of texts of class;
recall, also known as recall, and acquisition tests focused on category C i Can be correctly classified into the category C i The occupied proportion shows the completeness of the classification of the classifier, which is marked as R, and the specific formula is as follows:
belong to class C i But the text number sets C, A + C containing no feature item t, i.e. all should be classified as C i The text of the class;
the F1 value, also called the comprehensive classification rate, is a comprehensive evaluation index of the accuracy P and the recall rate R, and the specific formula is as follows:
the macro-average is the evaluation of the overall classification effect of the classifier, the arithmetic mean of the accuracy and the recall rate is the macro-average, and the specific formula is as follows:
wherein MacAvg _ Precision represents the macro-average of accuracy, macAvg _ Recall represents the macro-average of Recall, C represents the number of text categories contained in the training set, and P i Represents a class C i Accuracy of R i Represents class C i The recall ratio of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011019598.0A CN112256865B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method based on classifier |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910100095.7A CN109902173B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method |
CN202011019598.0A CN112256865B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method based on classifier |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910100095.7A Division CN109902173B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112256865A CN112256865A (en) | 2021-01-22 |
CN112256865B true CN112256865B (en) | 2023-03-21 |
Family
ID=66944611
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011019598.0A Active CN112256865B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method based on classifier |
CN201910100095.7A Active CN109902173B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910100095.7A Active CN109902173B (en) | 2019-01-31 | 2019-01-31 | Chinese text classification method |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN112256865B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111798853A (en) * | 2020-03-27 | 2020-10-20 | 北京京东尚科信息技术有限公司 | Method, device, equipment and computer readable medium for speech recognition |
CN112084308A (en) * | 2020-09-16 | 2020-12-15 | 中国信息通信研究院 | Method, system and storage medium for text type data recognition |
CN112215002A (en) * | 2020-11-02 | 2021-01-12 | 浙江大学 | Electric power system text data classification method based on improved naive Bayes |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN108509471A (en) * | 2017-05-19 | 2018-09-07 | 苏州纯青智能科技有限公司 | A kind of Chinese Text Categorization |
CN109165294A (en) * | 2018-08-21 | 2019-01-08 | 安徽讯飞智能科技有限公司 | Short text classification method based on Bayesian classification |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4713870B2 (en) * | 2004-10-13 | 2011-06-29 | ヒューレット−パッカード デベロップメント カンパニー エル.ピー. | Document classification apparatus, method, and program |
US8346534B2 (en) * | 2008-11-06 | 2013-01-01 | University of North Texas System | Method, system and apparatus for automatic keyword extraction |
CN101819601B (en) * | 2010-05-11 | 2012-02-08 | 同方知网(北京)技术有限公司 | Method for automatically classifying academic documents |
CN104063399B (en) * | 2013-03-22 | 2017-03-22 | 杭州娄文信息科技有限公司 | Method and system for automatically identifying emotional probability borne by texts |
CN105183831A (en) * | 2015-08-31 | 2015-12-23 | 上海德唐数据科技有限公司 | Text classification method for different subject topics |
-
2019
- 2019-01-31 CN CN202011019598.0A patent/CN112256865B/en active Active
- 2019-01-31 CN CN201910100095.7A patent/CN109902173B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105512311A (en) * | 2015-12-14 | 2016-04-20 | 北京工业大学 | Chi square statistic based self-adaption feature selection method |
CN108509471A (en) * | 2017-05-19 | 2018-09-07 | 苏州纯青智能科技有限公司 | A kind of Chinese Text Categorization |
CN109165294A (en) * | 2018-08-21 | 2019-01-08 | 安徽讯飞智能科技有限公司 | Short text classification method based on Bayesian classification |
Non-Patent Citations (2)
Title |
---|
"中文文本分类特征选择方法的研究与实现";林艳峰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160315;第I138-7803页 * |
"基于朴素贝叶斯方法的中文文本分类研究";李丹;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111115;第I138-519页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109902173A (en) | 2019-06-18 |
CN109902173B (en) | 2020-10-27 |
CN112256865A (en) | 2021-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104794B (en) | Text similarity matching method based on subject term | |
CN107992633B (en) | Automatic electronic document classification method and system based on keyword features | |
CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
CN103514183B (en) | Information search method and system based on interactive document clustering | |
US8380714B2 (en) | Method, computer system, and computer program for searching document data using search keyword | |
CN102411563B (en) | Method, device and system for identifying target words | |
CN100401302C (en) | Image meaning automatic marking method based on marking significance sequence | |
CN112256865B (en) | Chinese text classification method based on classifier | |
CN107291723A (en) | The method and apparatus of web page text classification, the method and apparatus of web page text identification | |
CN108197175B (en) | Processing method and device of technical supervision data, storage medium and processor | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN110472203B (en) | Article duplicate checking and detecting method, device, equipment and storage medium | |
CN101887415B (en) | Automatic extraction method for text document theme word meaning | |
CN110287493B (en) | Risk phrase identification method and device, electronic equipment and storage medium | |
CN112579783B (en) | Short text clustering method based on Laplace atlas | |
CN104794209A (en) | Chinese microblog sentiment classification method and system based on Markov logic network | |
CN109871429B (en) | Short text retrieval method integrating Wikipedia classification and explicit semantic features | |
CN116881799A (en) | Method for classifying cigarette production data | |
CN110888977B (en) | Text classification method, apparatus, computer device and storage medium | |
CN114511027B (en) | Method for extracting English remote data through big data network | |
CN115712720A (en) | Rainfall dynamic early warning method based on knowledge graph | |
CN114547233A (en) | Data duplicate checking method and device and electronic equipment | |
CN110413782B (en) | Automatic table theme classification method and device, computer equipment and storage medium | |
CN113326688A (en) | Ideological and political theory word duplication checking processing method and device | |
CN116414939B (en) | Article generation method based on multidimensional data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |