CN108959237B - Text classification method, device, medium and equipment - Google Patents
Text classification method, device, medium and equipment Download PDFInfo
- Publication number
- CN108959237B CN108959237B CN201710370523.9A CN201710370523A CN108959237B CN 108959237 B CN108959237 B CN 108959237B CN 201710370523 A CN201710370523 A CN 201710370523A CN 108959237 B CN108959237 B CN 108959237B
- Authority
- CN
- China
- Prior art keywords
- text
- category
- word
- classified
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 101100204393 Arabidopsis thaliana SUMO2 gene Proteins 0.000 description 2
- 101150112492 SUM-1 gene Proteins 0.000 description 2
- 101150096255 SUMO1 gene Proteins 0.000 description 2
- 101100311460 Schizosaccharomyces pombe (strain 972 / ATCC 24843) sum2 gene Proteins 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000000546 chi-square test Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text classification method, a text classification device, a text classification medium and text classification equipment, wherein the method comprises the following steps: determining a word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words; determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic word in any text category; and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category. The method and the device can improve the accuracy of determining the category of the text to be classified.
Description
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, medium, and device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The current commonly used text classification method is as follows:
extracting feature words in the text to be classified by using a chi-square test algorithm; aiming at each characteristic word, searching a text category corresponding to the characteristic word from the corresponding relation between the text category in the sample library and the sample characteristic word; taking the occurrence probability of the characteristic word in the same text category as the weight of the characteristic word under the text category; determining the category of the text to be classified according to the weight of each feature word under the category of the corresponding text; the larger the weight of the feature words in the same text category is, the higher the possibility that the category to which the text to be classified belongs is the category is.
The inventor finds that some characteristic words with obvious characteristics in the text to be classified play a key role in the classification of the text to be classified, and the existing text classification method can cause the problem that the weight of the characteristic words with obvious characteristics is low due to the fact that the number of sample characteristic words is not enough, and further cause the problem that the determined text category of the text belongs to a non-accurate mode.
Disclosure of Invention
The invention provides a text classification method, a text classification device, a text classification medium and text classification equipment, which are used for improving the accuracy of text classes to which texts to be classified belong.
In a first aspect, an embodiment of the present invention provides a text classification method, including:
aiming at each key feature word in the text to be classified, determining the word category corresponding to the key feature word according to the sample feature word corresponding to the word category in the sample library; and
determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
Optionally, the text classification method provided in the embodiment of the present invention further includes:
pre-storing word class weights, wherein each word class weight is used for representing the sum of weights of sample characteristic words belonging to the same word class in the same text class;
determining the weight of the key feature word under any corresponding text category, including:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, in the method, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category specifically includes:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, in the method, determining the category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically includes:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, in the method, the following formula is adopted to determine the probability that the text to be classified belongs to any text category:
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of.
In a second aspect, an embodiment of the present invention provides a text classification apparatus, including:
the first determining module is used for determining the word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
the second determining module is used for determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
a third determining module, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and the fourth determining module is used for determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
Optionally, the text classification apparatus provided in the embodiment of the present invention further includes:
the storage module is used for storing word class weights in advance, wherein each word class weight is used for representing the sum of the weights of sample characteristic words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t non-key feature word in the text to be classified, nRepresenting the number of non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of.
In a third aspect, an embodiment of the present invention provides a non-volatile computer storage medium, where an executable program is stored, and the executable program is executed by a processor to implement the steps of the text classification method according to any one of the above embodiments.
In a fourth aspect, an embodiment of the present invention provides a text classification device, which includes a memory, a processor, and a computer program stored in the memory, where the processor implements the steps of the text classification method according to any one of the above embodiments when executing the program.
By utilizing the text classification method, the text classification device, the text classification medium and the text classification equipment, the following beneficial effects are achieved:
the method comprises the steps of carrying out word category division on sample characteristic words in a sample library in advance, inducing the sample characteristic words with the same characteristics but different expression modes into sample characteristic words of the same word category, carrying out word category division on key characteristic words in a text to be classified, and determining the weight of the key characteristic words under any corresponding text category as the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic words under the text category, so that the weight of the key characteristic words under the text category is improved, and the accuracy of the text category to which the text to be classified belongs can be improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for determining a weight of each sample feature word in any text category according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for determining a text category to which a text to be classified belongs according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating another method for determining a text category to which a text to be classified belongs according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a text classification apparatus according to a second embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a four-text classification device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following will further describe possible embodiments of the present invention with reference to the accompanying drawings.
Example one
An embodiment of the present invention provides a text classification method, as shown in fig. 1, including:
In specific implementation, a plurality of word categories used for representing categories to which the characteristic words belong are divided in advance, sample characteristic words in the sample text are divided into corresponding word categories, and the corresponding relation between the word categories and the sample characteristic words is stored in a sample library. Specifically, a plurality of word categories can be obtained according to the semantic division of the special words, and the special words with the same semantic meaning or similar semantic meaning are further divided into the same word category. A plurality of word categories may also be obtained by dividing according to application scenarios of the feature words or the technical field to which the feature words belong, which is not limited herein.
When the embodiment of the invention is implemented in different application scenes, the correspondingly divided word categories are different, for example, in the application scene of classifying texts corresponding to news, the word categories can be divided into categories of science and technology, entertainment, finance, life and the like, and in the case of classifying medical articles, the word categories can be divided into categories of internal medicine, surgery, gynecology and the like.
Correspondingly, the word classification of the key feature words in the text to be classified is performed, so as to determine the word classification corresponding to the key feature words, and optionally, the word classification corresponding to the key feature words is determined according to the sample feature words corresponding to the word classification in the sample library, which specifically includes: searching sample characteristic words which are the same as the key characteristic words from the corresponding relation between the word category and the sample characteristic words in the sample library; and determining the word category corresponding to the searched sample characteristic word as the word category corresponding to the key characteristic word.
The method for extracting the key feature words in the text to be classified comprises the following steps: performing word segmentation on the text to be classified, and filtering prepositions, conjunctions and stop words in the text to be classified after the word segmentation to obtain characteristic words in the text to be classified; and selecting key characteristic words from the special evidence words in the text to be classified by using a chi-square test algorithm. It should be noted that the feature words in the text to be classified include key feature words and non-key feature words, where the non-key feature words are feature words that do not belong to key feature words in the feature words of the text to be classified.
The manner of extracting the sample feature words in the sample text may be: and performing word segmentation on the sample text, and filtering prepositions, conjunctions and stop words in the sample text after word segmentation to obtain characteristic words in the sample text.
Optionally, the key feature word in the text to be classified is an entity word in the text to be classified, and the entity word includes: nouns, special words, names of people, place names, etc. have actual meanings. Of course, all the feature words in the text to be classified may also be used as key feature words, wherein all the feature words in the sample text are used as sample feature words.
It should be noted that the sample feature words corresponding to the same word category are different from each other, that is, there are not two completely identical sample feature words in the same word category. The sample feature words in the same word category may be sample feature words in sample texts belonging to different text categories, and may also be sample feature words in sample texts belonging to the same text category.
In specific implementation, a plurality of text categories used for representing the categories to which the texts belong are divided in advance, the sample texts are divided into corresponding text categories, and the corresponding relation between the text categories and the sample texts is stored in a sample library. Specifically, a plurality of text categories can be obtained by dividing according to the content of the sample text, and the sample texts belonging to the same field or similar fields are further divided into the same text category.
When the embodiment of the invention is implemented in different application scenes, the texts to be classified and the sample texts stored in the sample library are the texts in the application scenes, for example, when the embodiment is applied to a scene for classifying news, the texts corresponding to a first number of news can be used as the sample texts of the first number, and the text categories obtained by dividing the application scenes can be categories such as science and technology, entertainment, finance, life, sports and the like, namely, some news belong to science and technology categories, and some news belong to finance categories; the method is applied to a scene of classifying medical science, texts corresponding to a first number of medical science can be used as a first number of sample texts, and the text categories obtained by dividing under the application scene can be medical, surgical, gynecological and other categories.
In this step, sample feature words that are the same as the key feature words are searched for from each sample text, and the text category corresponding to the searched sample text is used as the text category corresponding to the key feature words, that is, the text category corresponding to the sample text having the sample feature words that are the same as the key feature words is used as the text category corresponding to the key feature words. When the sample feature word which is the same as the key feature word is found in the sample texts belonging to different text categories, the key feature word corresponds to a plurality of text categories, and when the sample feature word which is the same as the key feature word is found in the sample text belonging to one text category, the key feature word corresponds to one text category.
It should be noted that, in the feature words included in any two texts corresponding to the same text category, the same feature words may exist, and the feature words included in the same text are different from each other, for example, the feature words included in the text a are A1, A2, and A3, the feature words included in the text B are A1, B2, and B3, the text categories corresponding to the text a and the text B are text categories 1, the feature words included in the text a and the text B are different from each other, and the feature words included in the text a are different from each other, and the feature words included in the text B are different from each other. The characteristic words included in the same text may belong to different word categories, for example, the characteristic word A1 in the text a belongs to the word category 1, and A2 and A3 belong to the word category 2; when the text is the sample text, the feature words included in the text are the feature words in the sample text, and when the text is the text to be classified, the feature words included in the text are the feature words in the text to be classified.
In specific implementation, the determining the weight of the key feature word under each corresponding text category includes:
determining sample characteristic words which are the same as any word category to which the key characteristic words belong in sample characteristic words included in sample texts corresponding to the text categories according to each text category corresponding to the key characteristic words to obtain a characteristic word set corresponding to any word category in the text categories, wherein the sample characteristic words in any sample characteristic word set belong to the same word category and are all sample characteristic words included in the sample texts corresponding to the text categories; and calculating the sum of the weights of all sample characteristic words in the sample characteristic word set under the text category aiming at each sample characteristic word set, and taking the sum as the weight of the key characteristic word under the corresponding text category so as to obtain the weight of the key characteristic word under each corresponding text category.
Or the sum value corresponding to each word category under each text category may be pre-calculated and stored, and then the weight of the key feature word under each corresponding text category is determined, including: and inquiring the sum value corresponding to the word class of the key characteristic word under the stored text class aiming at each text class corresponding to the key characteristic word, and taking the extracted sum value as the weight of the key characteristic word under the text class. The sum value corresponding to each word category under each text category is stored in advance, so that the weight of the key characteristic word under each corresponding text category can be determined only through query operation, and the calculation speed can be improved to a certain extent.
For example, it is assumed that the sample text corresponding to the text category 1 includes a sample text 1 and a sample text 2, the sample feature words in the sample text 1 are 11,12, 13,14, the sample feature words in the sample text 2 are 11, 21,22,23, 24, the sample text corresponding to the text category 2 includes a sample text 3, the sample feature words in the sample text 3 are 31, 32, the sample text corresponding to the text category 1 includes sample feature words 11,12, 13,14, 11, 21,22,23, and 24, and the sample text corresponding to the text category 2 includes sample feature words 31 and 32, where the sample feature words 11,12,21,22,23, and 31 belong to the same word category, such as all belong to the word category 1, and the sample feature words 13,14,24, and 32 belong to the same word category, such as all belong to the word category 2; then, the feature word set corresponding to the word category 1 in the text category 1 is: s1= {11,12,21,22,23}, and the feature word set corresponding to the word category 2 is: s2= {13,14,24}; calculating the weights P11, P12, P21, P22 and P23 of each sample feature word in the feature word set S1 under the text category 1, calculating SUM1= P11+ P12+ P21+ P22+ P23, taking SUM1 as the SUM value corresponding to the word category 1 under the text category 1, calculating the weights P13, P14 and P24 of each feature word in the feature word set S2 under the text category 1, calculating SUM2= P13+ P14+ P24, and taking SUM2 as the SUM value corresponding to the word category 2 under the text category 1.
And 104, determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
In specific implementation, the weight of each key feature word in the text to be classified under each corresponding text category is determined. The sum of the weights of the key feature words belonging to the same text category in the text category can be used as the probability that the text to be classified belongs to the text category, and the text category corresponding to the maximum probability is determined as the text category to which the text to be classified belongs.
It should be noted that, in the embodiments of the present invention, the words belonging to the same word category are equivalent to the words belonging to the same word category, and the words belonging to the same text category are equivalent to the texts belonging to the same text category.
According to the embodiment of the invention, the sample characteristic words in the sample library are subjected to word class division in advance, the sample characteristic words with the same characteristics but different expression modes can be summarized into the sample characteristic words of the same word class, the word class division is carried out on the key characteristic words in the text to be classified, and the weight of the key characteristic words in any corresponding text class is determined as the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic words in the text class, so that the weight of the key characteristic words in the text class is improved, and the accuracy of the determined text class to which the text to be classified belongs can be improved.
Optionally, the text classification method provided in the embodiment of the present invention further includes: pre-storing word category weights, wherein each word category weight is used for representing the sum of weights of sample characteristic words belonging to the same word category in the same text category, and determining the weight of the key characteristic word in any corresponding text category, and the method comprises the following steps:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights; and taking the acquired word category weight as the weight of the key characteristic word under any corresponding text category.
In specific implementation, a sum of weights of sample feature words belonging to the same word category in the same text category is calculated in advance, and the sum is stored as a word category weight, for example, the sum of weights of sample feature words belonging to the word category 1 in the text category 1 is stored as a word category weight r, wherein when the weight of a key feature word in any corresponding text category is determined, a word category weight corresponding to the word category to which the key feature word belongs is searched from the stored word category weights, the searched word category weights may be multiple, the word category weight in any text category is obtained from the searched multiple word category weights, and the obtained word category weight is used as the weight of the key feature word in any text category.
Optionally, the text classification method provided in the embodiment of the present invention further includes:
the method comprises the following steps of preserving the corresponding relation among word categories, text categories and word category weights in advance, wherein only one word category weight is determined according to the word categories and the text categories, each word category weight is used for representing the sum of weights of sample characteristic words belonging to the same word category in the same text category, and then the weight of the key characteristic word in any corresponding text category is determined, and the method comprises the following steps:
and searching any text category and the word category weight corresponding to the word category which is the same as the word category to which the key characteristic word belongs from the corresponding relation among the word categories, the text categories and the word category weights which are stored in advance, and taking the searched word category weight as the weight of the key characteristic word under the corresponding any text category.
In specific implementation, a word class weight table may be pre-stored, the same entry in the table includes a word class, a text class, and a word class weight, the word class weight in the entry is used to represent a sum of weights of sample feature words belonging to the word class in the entry under the text class in the entry, and a word class weight may be uniquely determined according to any word class and any text class in the word class weight table.
The word category weight table is shown as table one:
watch 1
Word classes | Text categories | Word class weight |
Word class 1 | Text category 1 | 13% |
Word class 1 | Text category 2 | 14% |
In table one, the word class 1, the text class 1, and 13% belong to the same entry, and the word class 1, the text class 2, and 14% belong to the same entry.
Optionally, in the embodiment of the present invention, the weight of each sample feature word in any text category is determined in advance according to the content provided in fig. 2:
Step 202, determining the weight of the sample feature word under any text category according to the times and the total number.
In specific implementation, the number of times that the sample feature word appears in the sample feature word included in the sample text corresponding to any text type is counted in advance, the number of times that the sample feature word appears in the sample text corresponding to any text type is used as the number of times that the sample feature word appears in the sample text corresponding to any text type, the total number of the sample feature words in the sample text corresponding to any text type is counted in advance, and the ratio of the number of times to the total number is used as the weight of the sample feature word under any text type, or the ratio of the number of times to the total number is converted into a percentage and used as the weight of the sample feature word under any text type.
Continuing with the above, further exemplifying the weight of the sample feature word 11 under the text category 1 in the explanation between the step 103 and the step 104, if the number of times that the sample feature word 11 appears in the sample text corresponding to the text category 1 is 2, and the total number of sample feature words in the sample text corresponding to the text category 1 is 9, then the above is followed, and thenOr,
optionally, according to the content provided in fig. 3, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category:
The larger the sum of the weights of the key feature words corresponding to the text category in the text category is, the larger the probability of occurrence of the text to be classified in the sample text corresponding to the text category is, and correspondingly, the larger the sum of the conditional probabilities of the non-key feature words in the text to be classified in the text category is, the larger the probability of occurrence of the text to be classified in the sample text corresponding to the text category is. Specifically, the result obtained by adding the sum of the weights of the key feature words corresponding to the text category in the text category to the sum of the conditional probabilities of the non-key feature words in the text to be classified in the text category can be used as the occurrence probability of the text to be classified in the sample text corresponding to the text category.
The greater the occurrence probability of the text to be classified in any sample text corresponding to the text category, the greater the probability that the text to be classified belongs to any text category.
The specific implementation of step 302 is: determining a text category corresponding to the occurrence probability of a text to be classified, which is not less than a set probability threshold value, in the occurrence probability of sample texts corresponding to each text category as the text category to which the text to be classified belongs; the set probability threshold can be set according to the actual application scene.
For example, the occurrence probability of the text to be classified in the sample text corresponding to the text category 1 is P1, the occurrence probability in the sample text corresponding to the text category 2 is P2, the occurrence probability in the sample text corresponding to the text category 3 is P3, and the preset probability threshold is P4, where P1 is greater than P4, and both P2 and P3 are less than P4, then the text category 3 is determined as the text category to which the text to be classified belongs.
Alternatively, the specific implementation of step 302 is: and determining the text category corresponding to the maximum probability in the probabilities of the texts to be classified as the text category to which the text to be classified belongs.
For example, the probability of occurrence of the text to be classified in the sample text corresponding to the text category 1 is P1, and the probability of occurrence of the text to be classified in the sample text corresponding to the text category 2 is P2, where P1 is greater than P2, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Optionally, according to the content provided in fig. 4, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category:
The larger the product of the weights of the key feature words corresponding to the text category under the text category is, the larger the occurrence probability of the text to be classified in the sample text corresponding to the text category is, the larger the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category is, and the larger the occurrence probability of the text to be classified in the sample text corresponding to the text category is. Specifically, the result obtained by multiplying the product of the weights of the key feature words corresponding to the text category under the text category by the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category is used as the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, the following formula is adopted to determine the probability that the text to be classified belongs to any text category:
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of the following, P (x)Representing the prior probability of said text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
In specific implementation, the specific implementation manner of step 403 is: and determining the text category corresponding to the probability which is not less than the set probability threshold value in the probability that the text to be classified belongs to each text category as the text category to which the text to be classified belongs. For example, the probability that the text to be classified belongs to the text category 1 is Pa, the probability that the text to be classified belongs to the text category 2 is Pb, and the preset probability threshold is Pc, where Pa is greater than Pc and Pb is less than Pc, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Alternatively, the specific implementation manner of step 403 is: and determining the text category corresponding to the maximum probability in the probabilities of the texts to be classified belonging to the corresponding texts as the text category to which the text to be classified belongs. For example, the probability that the text to be classified belongs to the text category 1 is Pa, and the probability that the text to be classified belongs to the text category 2 is Pb, where Pa is greater than Pb, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Example two
An embodiment of the present invention provides a text classification apparatus, as shown in fig. 5, including:
a first determining module 501, configured to determine, for each key feature word in the text to be classified, a word category corresponding to the key feature word according to a sample feature word corresponding to the word category in the sample library; and
a second determining module 502, configured to determine, according to a correspondence between a text category in the sample library and a sample text, a text category corresponding to the sample text having the key feature word, and use the determined text category as the text category corresponding to the key feature word, where a sample text corresponding to each text category includes a plurality of sample feature words;
a third determining module 503, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic word in any text category;
a fourth determining module 504, configured to determine, according to the weight of each key feature word in each corresponding text category, a text category to which the text to be classified belongs.
Optionally, the text classification apparatus provided in the embodiment of the present invention further includes:
a storage module 505, configured to pre-store word class weights, where each word class weight is used to represent a sum of weights of sample feature words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to the text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
EXAMPLE III
The embodiment of the invention provides a nonvolatile computer storage medium, wherein an executable program is stored in the computer storage medium, and the executable program is executed by a processor to realize the steps of any text classification method in the first embodiment.
Example four
An embodiment of the present invention provides a text classification device, configured to execute any text classification method in the first embodiment, as shown in fig. 6, which is a schematic diagram of a hardware structure of the text classification device in the fourth embodiment of the present invention, where the text classification device may be a desktop computer, a portable computer, a smart phone, a tablet computer, and the like. Specifically, the text classification device may comprise a memory 601, a processor 602 and a computer program stored on the memory, the processor implementing the steps of any of the text classification methods of the first embodiment when executing the program. Memory 601 may include, among other things, read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 602 with program instructions and data stored in memory 601.
Further, the text classification apparatus according to the fourth embodiment of the present invention may further include an input device 603 and an output device 604. The input device 603 may include a keyboard, mouse, touch screen, etc.; the output device 604 may include a Display apparatus, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like. The memory 601, the processor 602, the input device 603, and the output device 604 may be connected by a bus or other means, and are exemplified by being connected by a bus in fig. 6.
By utilizing the text classification method, the text classification device, the text classification medium and the text classification equipment, the following beneficial effects are achieved:
the method comprises the steps of performing word category division on sample characteristic words in a sample library in advance, summarizing the sample characteristic words with the same characteristics but different expression modes into the sample characteristic words of the same word category, performing word category division on key characteristic words in a text to be classified, and determining the weight of the key characteristic words in any corresponding text category as the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic words in the text category, so that the weight of the key characteristic words in the text category is improved, and the accuracy of the text category to which the text to be classified belongs can be improved.
It should be noted that although several modules of the text classification apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (12)
1. A method of text classification, comprising:
determining a word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
2. The method of claim 1, further comprising:
pre-storing word class weights, wherein each word class weight is used for representing the sum of weights of sample characteristic words belonging to the same word class in the same text class;
determining the weight of the key feature word under any corresponding text category, including:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
3. The method according to claim 1, wherein determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically comprises:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
4. The method according to claim 1, wherein determining the category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically comprises:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
5. The method according to claim 4, characterized in that the probability that the text to be classified belongs to any text category is determined using the following formula:
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to the text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
6. A text classification apparatus, comprising:
the first determining module is used for determining the word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
the second determining module is used for determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
a third determining module, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and the fourth determining module is used for determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
7. The apparatus of claim 6, further comprising:
the storage module is used for pre-storing word class weights, wherein each word class weight is used for representing the sum of the weights of sample characteristic words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
8. The apparatus of claim 6, wherein the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
9. The apparatus of claim 6, wherein the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
10. The apparatus according to claim 9, wherein the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of (c).
11. A non-transitory computer storage medium storing an executable program for execution by a processor to perform the steps of the method of any one of claims 1 to 5.
12. A text classification device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the steps of the method of any one of claims 1 to 5 when executing the program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710370523.9A CN108959237B (en) | 2017-05-23 | 2017-05-23 | Text classification method, device, medium and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710370523.9A CN108959237B (en) | 2017-05-23 | 2017-05-23 | Text classification method, device, medium and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108959237A CN108959237A (en) | 2018-12-07 |
CN108959237B true CN108959237B (en) | 2022-11-22 |
Family
ID=64494216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710370523.9A Active CN108959237B (en) | 2017-05-23 | 2017-05-23 | Text classification method, device, medium and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959237B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717040A (en) * | 2019-09-18 | 2020-01-21 | 平安科技(深圳)有限公司 | Dictionary expansion method and device, electronic equipment and storage medium |
CN115759072B (en) * | 2022-11-21 | 2024-03-12 | 时趣互动(北京)科技有限公司 | Feature word classification method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7836061B1 (en) * | 2007-12-29 | 2010-11-16 | Kaspersky Lab, Zao | Method and system for classifying electronic text messages and spam messages |
CN104915356A (en) * | 2014-03-13 | 2015-09-16 | 中国移动通信集团上海有限公司 | Text classification correcting method and device |
-
2017
- 2017-05-23 CN CN201710370523.9A patent/CN108959237B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7836061B1 (en) * | 2007-12-29 | 2010-11-16 | Kaspersky Lab, Zao | Method and system for classifying electronic text messages and spam messages |
CN104915356A (en) * | 2014-03-13 | 2015-09-16 | 中国移动通信集团上海有限公司 | Text classification correcting method and device |
Non-Patent Citations (1)
Title |
---|
结合词性的短文本相似度算法及其在文本分类中的应用;黄贤英等;《电讯技术》;20170128(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN108959237A (en) | 2018-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114416927B (en) | Intelligent question-answering method, device, equipment and storage medium | |
CN108287864B (en) | Interest group dividing method, device, medium and computing equipment | |
CN112101437B (en) | Fine granularity classification model processing method based on image detection and related equipment thereof | |
US11238050B2 (en) | Method and apparatus for determining response for user input data, and medium | |
US20190057084A1 (en) | Method and device for identifying information | |
CN108536739B (en) | Metadata sensitive information field identification method, device, equipment and storage medium | |
CN110390106B (en) | Semantic disambiguation method, device, equipment and storage medium based on two-way association | |
CN112632257A (en) | Question processing method and device based on semantic matching, terminal and storage medium | |
CN112506864B (en) | File retrieval method, device, electronic equipment and readable storage medium | |
CN113609847B (en) | Information extraction method, device, electronic equipment and storage medium | |
CN113722438A (en) | Sentence vector generation method and device based on sentence vector model and computer equipment | |
CN115168537B (en) | Training method and device for semantic retrieval model, electronic equipment and storage medium | |
CN103309984A (en) | Data processing method and device | |
CN116204672A (en) | Image recognition method, image recognition model training method, image recognition device, image recognition model training device, image recognition equipment, image recognition model training equipment and storage medium | |
CN108182200B (en) | Keyword expansion method and device based on semantic similarity | |
CN108959237B (en) | Text classification method, device, medium and equipment | |
EP3992814A2 (en) | Method and apparatus for generating user interest profile, electronic device and storage medium | |
CN114120166A (en) | Video question and answer method and device, electronic equipment and storage medium | |
CN117333889A (en) | Training method and device for document detection model and electronic equipment | |
CN112559713B (en) | Text relevance judging method and device, model, electronic equipment and readable medium | |
US20240095286A1 (en) | Information processing apparatus, classification method, and storage medium | |
CN114970666A (en) | Spoken language processing method and device, electronic equipment and storage medium | |
CN114817476A (en) | Language model training method and device, electronic equipment and storage medium | |
CN113901302A (en) | Data processing method, device, electronic equipment and medium | |
CN113392124B (en) | Structured language-based data query method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |