CN108959237B - Text classification method, device, medium and equipment - Google Patents

Text classification method, device, medium and equipment Download PDF

Info

Publication number
CN108959237B
CN108959237B CN201710370523.9A CN201710370523A CN108959237B CN 108959237 B CN108959237 B CN 108959237B CN 201710370523 A CN201710370523 A CN 201710370523A CN 108959237 B CN108959237 B CN 108959237B
Authority
CN
China
Prior art keywords
text
category
word
classified
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710370523.9A
Other languages
Chinese (zh)
Other versions
CN108959237A (en
Inventor
李探
温旭
张智敏
常卓
王树伟
花少勇
张伟
闫清岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Beijing Co Ltd
Original Assignee
Tencent Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Beijing Co Ltd filed Critical Tencent Technology Beijing Co Ltd
Priority to CN201710370523.9A priority Critical patent/CN108959237B/en
Publication of CN108959237A publication Critical patent/CN108959237A/en
Application granted granted Critical
Publication of CN108959237B publication Critical patent/CN108959237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text classification method, a text classification device, a text classification medium and text classification equipment, wherein the method comprises the following steps: determining a word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words; determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic word in any text category; and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category. The method and the device can improve the accuracy of determining the category of the text to be classified.

Description

Text classification method, device, medium and equipment
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, medium, and device.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
The current commonly used text classification method is as follows:
extracting feature words in the text to be classified by using a chi-square test algorithm; aiming at each characteristic word, searching a text category corresponding to the characteristic word from the corresponding relation between the text category in the sample library and the sample characteristic word; taking the occurrence probability of the characteristic word in the same text category as the weight of the characteristic word under the text category; determining the category of the text to be classified according to the weight of each feature word under the category of the corresponding text; the larger the weight of the feature words in the same text category is, the higher the possibility that the category to which the text to be classified belongs is the category is.
The inventor finds that some characteristic words with obvious characteristics in the text to be classified play a key role in the classification of the text to be classified, and the existing text classification method can cause the problem that the weight of the characteristic words with obvious characteristics is low due to the fact that the number of sample characteristic words is not enough, and further cause the problem that the determined text category of the text belongs to a non-accurate mode.
Disclosure of Invention
The invention provides a text classification method, a text classification device, a text classification medium and text classification equipment, which are used for improving the accuracy of text classes to which texts to be classified belong.
In a first aspect, an embodiment of the present invention provides a text classification method, including:
aiming at each key feature word in the text to be classified, determining the word category corresponding to the key feature word according to the sample feature word corresponding to the word category in the sample library; and
determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
Optionally, the text classification method provided in the embodiment of the present invention further includes:
pre-storing word class weights, wherein each word class weight is used for representing the sum of weights of sample characteristic words belonging to the same word class in the same text class;
determining the weight of the key feature word under any corresponding text category, including:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, in the method, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category specifically includes:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, in the method, determining the category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically includes:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, in the method, the following formula is adopted to determine the probability that the text to be classified belongs to any text category:
Figure BDA0001302663070000031
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of.
In a second aspect, an embodiment of the present invention provides a text classification apparatus, including:
the first determining module is used for determining the word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
the second determining module is used for determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
a third determining module, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and the fourth determining module is used for determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
Optionally, the text classification apparatus provided in the embodiment of the present invention further includes:
the storage module is used for storing word class weights in advance, wherein each word class weight is used for representing the sum of the weights of sample characteristic words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, in the apparatus, the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
Figure BDA0001302663070000051
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t non-key feature word in the text to be classified, nRepresenting the number of non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of.
In a third aspect, an embodiment of the present invention provides a non-volatile computer storage medium, where an executable program is stored, and the executable program is executed by a processor to implement the steps of the text classification method according to any one of the above embodiments.
In a fourth aspect, an embodiment of the present invention provides a text classification device, which includes a memory, a processor, and a computer program stored in the memory, where the processor implements the steps of the text classification method according to any one of the above embodiments when executing the program.
By utilizing the text classification method, the text classification device, the text classification medium and the text classification equipment, the following beneficial effects are achieved:
the method comprises the steps of carrying out word category division on sample characteristic words in a sample library in advance, inducing the sample characteristic words with the same characteristics but different expression modes into sample characteristic words of the same word category, carrying out word category division on key characteristic words in a text to be classified, and determining the weight of the key characteristic words under any corresponding text category as the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic words under the text category, so that the weight of the key characteristic words under the text category is improved, and the accuracy of the text category to which the text to be classified belongs can be improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
fig. 1 is a schematic flowchart of a text recognition method according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for determining a weight of each sample feature word in any text category according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for determining a text category to which a text to be classified belongs according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating another method for determining a text category to which a text to be classified belongs according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a text classification apparatus according to a second embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a four-text classification device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following will further describe possible embodiments of the present invention with reference to the accompanying drawings.
Example one
An embodiment of the present invention provides a text classification method, as shown in fig. 1, including:
step 101, determining a word class corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word class in the sample library.
In specific implementation, a plurality of word categories used for representing categories to which the characteristic words belong are divided in advance, sample characteristic words in the sample text are divided into corresponding word categories, and the corresponding relation between the word categories and the sample characteristic words is stored in a sample library. Specifically, a plurality of word categories can be obtained according to the semantic division of the special words, and the special words with the same semantic meaning or similar semantic meaning are further divided into the same word category. A plurality of word categories may also be obtained by dividing according to application scenarios of the feature words or the technical field to which the feature words belong, which is not limited herein.
When the embodiment of the invention is implemented in different application scenes, the correspondingly divided word categories are different, for example, in the application scene of classifying texts corresponding to news, the word categories can be divided into categories of science and technology, entertainment, finance, life and the like, and in the case of classifying medical articles, the word categories can be divided into categories of internal medicine, surgery, gynecology and the like.
Correspondingly, the word classification of the key feature words in the text to be classified is performed, so as to determine the word classification corresponding to the key feature words, and optionally, the word classification corresponding to the key feature words is determined according to the sample feature words corresponding to the word classification in the sample library, which specifically includes: searching sample characteristic words which are the same as the key characteristic words from the corresponding relation between the word category and the sample characteristic words in the sample library; and determining the word category corresponding to the searched sample characteristic word as the word category corresponding to the key characteristic word.
The method for extracting the key feature words in the text to be classified comprises the following steps: performing word segmentation on the text to be classified, and filtering prepositions, conjunctions and stop words in the text to be classified after the word segmentation to obtain characteristic words in the text to be classified; and selecting key characteristic words from the special evidence words in the text to be classified by using a chi-square test algorithm. It should be noted that the feature words in the text to be classified include key feature words and non-key feature words, where the non-key feature words are feature words that do not belong to key feature words in the feature words of the text to be classified.
The manner of extracting the sample feature words in the sample text may be: and performing word segmentation on the sample text, and filtering prepositions, conjunctions and stop words in the sample text after word segmentation to obtain characteristic words in the sample text.
Optionally, the key feature word in the text to be classified is an entity word in the text to be classified, and the entity word includes: nouns, special words, names of people, place names, etc. have actual meanings. Of course, all the feature words in the text to be classified may also be used as key feature words, wherein all the feature words in the sample text are used as sample feature words.
It should be noted that the sample feature words corresponding to the same word category are different from each other, that is, there are not two completely identical sample feature words in the same word category. The sample feature words in the same word category may be sample feature words in sample texts belonging to different text categories, and may also be sample feature words in sample texts belonging to the same text category.
Step 102, determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words.
In specific implementation, a plurality of text categories used for representing the categories to which the texts belong are divided in advance, the sample texts are divided into corresponding text categories, and the corresponding relation between the text categories and the sample texts is stored in a sample library. Specifically, a plurality of text categories can be obtained by dividing according to the content of the sample text, and the sample texts belonging to the same field or similar fields are further divided into the same text category.
When the embodiment of the invention is implemented in different application scenes, the texts to be classified and the sample texts stored in the sample library are the texts in the application scenes, for example, when the embodiment is applied to a scene for classifying news, the texts corresponding to a first number of news can be used as the sample texts of the first number, and the text categories obtained by dividing the application scenes can be categories such as science and technology, entertainment, finance, life, sports and the like, namely, some news belong to science and technology categories, and some news belong to finance categories; the method is applied to a scene of classifying medical science, texts corresponding to a first number of medical science can be used as a first number of sample texts, and the text categories obtained by dividing under the application scene can be medical, surgical, gynecological and other categories.
In this step, sample feature words that are the same as the key feature words are searched for from each sample text, and the text category corresponding to the searched sample text is used as the text category corresponding to the key feature words, that is, the text category corresponding to the sample text having the sample feature words that are the same as the key feature words is used as the text category corresponding to the key feature words. When the sample feature word which is the same as the key feature word is found in the sample texts belonging to different text categories, the key feature word corresponds to a plurality of text categories, and when the sample feature word which is the same as the key feature word is found in the sample text belonging to one text category, the key feature word corresponds to one text category.
It should be noted that, in the feature words included in any two texts corresponding to the same text category, the same feature words may exist, and the feature words included in the same text are different from each other, for example, the feature words included in the text a are A1, A2, and A3, the feature words included in the text B are A1, B2, and B3, the text categories corresponding to the text a and the text B are text categories 1, the feature words included in the text a and the text B are different from each other, and the feature words included in the text a are different from each other, and the feature words included in the text B are different from each other. The characteristic words included in the same text may belong to different word categories, for example, the characteristic word A1 in the text a belongs to the word category 1, and A2 and A3 belong to the word category 2; when the text is the sample text, the feature words included in the text are the feature words in the sample text, and when the text is the text to be classified, the feature words included in the text are the feature words in the text to be classified.
Step 103, determining the weight of the key feature word in each corresponding text category, wherein the weight of the key feature word in any corresponding text category is as follows: and the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class.
In specific implementation, the determining the weight of the key feature word under each corresponding text category includes:
determining sample characteristic words which are the same as any word category to which the key characteristic words belong in sample characteristic words included in sample texts corresponding to the text categories according to each text category corresponding to the key characteristic words to obtain a characteristic word set corresponding to any word category in the text categories, wherein the sample characteristic words in any sample characteristic word set belong to the same word category and are all sample characteristic words included in the sample texts corresponding to the text categories; and calculating the sum of the weights of all sample characteristic words in the sample characteristic word set under the text category aiming at each sample characteristic word set, and taking the sum as the weight of the key characteristic word under the corresponding text category so as to obtain the weight of the key characteristic word under each corresponding text category.
Or the sum value corresponding to each word category under each text category may be pre-calculated and stored, and then the weight of the key feature word under each corresponding text category is determined, including: and inquiring the sum value corresponding to the word class of the key characteristic word under the stored text class aiming at each text class corresponding to the key characteristic word, and taking the extracted sum value as the weight of the key characteristic word under the text class. The sum value corresponding to each word category under each text category is stored in advance, so that the weight of the key characteristic word under each corresponding text category can be determined only through query operation, and the calculation speed can be improved to a certain extent.
For example, it is assumed that the sample text corresponding to the text category 1 includes a sample text 1 and a sample text 2, the sample feature words in the sample text 1 are 11,12, 13,14, the sample feature words in the sample text 2 are 11, 21,22,23, 24, the sample text corresponding to the text category 2 includes a sample text 3, the sample feature words in the sample text 3 are 31, 32, the sample text corresponding to the text category 1 includes sample feature words 11,12, 13,14, 11, 21,22,23, and 24, and the sample text corresponding to the text category 2 includes sample feature words 31 and 32, where the sample feature words 11,12,21,22,23, and 31 belong to the same word category, such as all belong to the word category 1, and the sample feature words 13,14,24, and 32 belong to the same word category, such as all belong to the word category 2; then, the feature word set corresponding to the word category 1 in the text category 1 is: s1= {11,12,21,22,23}, and the feature word set corresponding to the word category 2 is: s2= {13,14,24}; calculating the weights P11, P12, P21, P22 and P23 of each sample feature word in the feature word set S1 under the text category 1, calculating SUM1= P11+ P12+ P21+ P22+ P23, taking SUM1 as the SUM value corresponding to the word category 1 under the text category 1, calculating the weights P13, P14 and P24 of each feature word in the feature word set S2 under the text category 1, calculating SUM2= P13+ P14+ P24, and taking SUM2 as the SUM value corresponding to the word category 2 under the text category 1.
And 104, determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
In specific implementation, the weight of each key feature word in the text to be classified under each corresponding text category is determined. The sum of the weights of the key feature words belonging to the same text category in the text category can be used as the probability that the text to be classified belongs to the text category, and the text category corresponding to the maximum probability is determined as the text category to which the text to be classified belongs.
It should be noted that, in the embodiments of the present invention, the words belonging to the same word category are equivalent to the words belonging to the same word category, and the words belonging to the same text category are equivalent to the texts belonging to the same text category.
According to the embodiment of the invention, the sample characteristic words in the sample library are subjected to word class division in advance, the sample characteristic words with the same characteristics but different expression modes can be summarized into the sample characteristic words of the same word class, the word class division is carried out on the key characteristic words in the text to be classified, and the weight of the key characteristic words in any corresponding text class is determined as the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic words in the text class, so that the weight of the key characteristic words in the text class is improved, and the accuracy of the determined text class to which the text to be classified belongs can be improved.
Optionally, the text classification method provided in the embodiment of the present invention further includes: pre-storing word category weights, wherein each word category weight is used for representing the sum of weights of sample characteristic words belonging to the same word category in the same text category, and determining the weight of the key characteristic word in any corresponding text category, and the method comprises the following steps:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights; and taking the acquired word category weight as the weight of the key characteristic word under any corresponding text category.
In specific implementation, a sum of weights of sample feature words belonging to the same word category in the same text category is calculated in advance, and the sum is stored as a word category weight, for example, the sum of weights of sample feature words belonging to the word category 1 in the text category 1 is stored as a word category weight r, wherein when the weight of a key feature word in any corresponding text category is determined, a word category weight corresponding to the word category to which the key feature word belongs is searched from the stored word category weights, the searched word category weights may be multiple, the word category weight in any text category is obtained from the searched multiple word category weights, and the obtained word category weight is used as the weight of the key feature word in any text category.
Optionally, the text classification method provided in the embodiment of the present invention further includes:
the method comprises the following steps of preserving the corresponding relation among word categories, text categories and word category weights in advance, wherein only one word category weight is determined according to the word categories and the text categories, each word category weight is used for representing the sum of weights of sample characteristic words belonging to the same word category in the same text category, and then the weight of the key characteristic word in any corresponding text category is determined, and the method comprises the following steps:
and searching any text category and the word category weight corresponding to the word category which is the same as the word category to which the key characteristic word belongs from the corresponding relation among the word categories, the text categories and the word category weights which are stored in advance, and taking the searched word category weight as the weight of the key characteristic word under the corresponding any text category.
In specific implementation, a word class weight table may be pre-stored, the same entry in the table includes a word class, a text class, and a word class weight, the word class weight in the entry is used to represent a sum of weights of sample feature words belonging to the word class in the entry under the text class in the entry, and a word class weight may be uniquely determined according to any word class and any text class in the word class weight table.
The word category weight table is shown as table one:
watch 1
Word classes Text categories Word class weight
Word class 1 Text category 1 13%
Word class 1 Text category 2 14%
In table one, the word class 1, the text class 1, and 13% belong to the same entry, and the word class 1, the text class 2, and 14% belong to the same entry.
Optionally, in the embodiment of the present invention, the weight of each sample feature word in any text category is determined in advance according to the content provided in fig. 2:
step 201, for each sample feature word, determining the number of times that the sample feature word appears in the sample text corresponding to any text type and the total number of sample feature words in the sample text corresponding to any text type.
Step 202, determining the weight of the sample feature word under any text category according to the times and the total number.
In specific implementation, the number of times that the sample feature word appears in the sample feature word included in the sample text corresponding to any text type is counted in advance, the number of times that the sample feature word appears in the sample text corresponding to any text type is used as the number of times that the sample feature word appears in the sample text corresponding to any text type, the total number of the sample feature words in the sample text corresponding to any text type is counted in advance, and the ratio of the number of times to the total number is used as the weight of the sample feature word under any text type, or the ratio of the number of times to the total number is converted into a percentage and used as the weight of the sample feature word under any text type.
Continuing with the above, further exemplifying the weight of the sample feature word 11 under the text category 1 in the explanation between the step 103 and the step 104, if the number of times that the sample feature word 11 appears in the sample text corresponding to the text category 1 is 2, and the total number of sample feature words in the sample text corresponding to the text category 1 is 9, then the above is followed, and then
Figure BDA0001302663070000131
Or,
Figure BDA0001302663070000132
optionally, according to the content provided in fig. 3, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category:
step 301, for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category in the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified in the text category.
The larger the sum of the weights of the key feature words corresponding to the text category in the text category is, the larger the probability of occurrence of the text to be classified in the sample text corresponding to the text category is, and correspondingly, the larger the sum of the conditional probabilities of the non-key feature words in the text to be classified in the text category is, the larger the probability of occurrence of the text to be classified in the sample text corresponding to the text category is. Specifically, the result obtained by adding the sum of the weights of the key feature words corresponding to the text category in the text category to the sum of the conditional probabilities of the non-key feature words in the text to be classified in the text category can be used as the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Step 302, determining a text category to which a text to be classified belongs according to the occurrence probability of the text to be classified in sample texts corresponding to each text category.
The greater the occurrence probability of the text to be classified in any sample text corresponding to the text category, the greater the probability that the text to be classified belongs to any text category.
The specific implementation of step 302 is: determining a text category corresponding to the occurrence probability of a text to be classified, which is not less than a set probability threshold value, in the occurrence probability of sample texts corresponding to each text category as the text category to which the text to be classified belongs; the set probability threshold can be set according to the actual application scene.
For example, the occurrence probability of the text to be classified in the sample text corresponding to the text category 1 is P1, the occurrence probability in the sample text corresponding to the text category 2 is P2, the occurrence probability in the sample text corresponding to the text category 3 is P3, and the preset probability threshold is P4, where P1 is greater than P4, and both P2 and P3 are less than P4, then the text category 3 is determined as the text category to which the text to be classified belongs.
Alternatively, the specific implementation of step 302 is: and determining the text category corresponding to the maximum probability in the probabilities of the texts to be classified as the text category to which the text to be classified belongs.
For example, the probability of occurrence of the text to be classified in the sample text corresponding to the text category 1 is P1, and the probability of occurrence of the text to be classified in the sample text corresponding to the text category 2 is P2, where P1 is greater than P2, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Optionally, according to the content provided in fig. 4, determining the text category to which the text to be classified belongs according to the weight of each key feature word in each corresponding text category:
step 401, determining, for a text category corresponding to each key feature word, an occurrence probability of the text to be classified in a sample text corresponding to the text category according to a product of weights of the key feature words corresponding to the text category under the text category and a product of conditional probabilities of non-key feature words in the text to be classified under the text category.
The larger the product of the weights of the key feature words corresponding to the text category under the text category is, the larger the occurrence probability of the text to be classified in the sample text corresponding to the text category is, the larger the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category is, and the larger the occurrence probability of the text to be classified in the sample text corresponding to the text category is. Specifically, the result obtained by multiplying the product of the weights of the key feature words corresponding to the text category under the text category by the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category is used as the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Step 402, determining the probability that the text to be classified belongs to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified.
Optionally, the following formula is adopted to determine the probability that the text to be classified belongs to any text category:
Figure BDA0001302663070000141
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of the following, P (x)Representing the prior probability of said text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
Step 403, determining the text category to which the text to be classified belongs according to the probability that the text to be classified belongs to each text category.
In specific implementation, the specific implementation manner of step 403 is: and determining the text category corresponding to the probability which is not less than the set probability threshold value in the probability that the text to be classified belongs to each text category as the text category to which the text to be classified belongs. For example, the probability that the text to be classified belongs to the text category 1 is Pa, the probability that the text to be classified belongs to the text category 2 is Pb, and the preset probability threshold is Pc, where Pa is greater than Pc and Pb is less than Pc, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Alternatively, the specific implementation manner of step 403 is: and determining the text category corresponding to the maximum probability in the probabilities of the texts to be classified belonging to the corresponding texts as the text category to which the text to be classified belongs. For example, the probability that the text to be classified belongs to the text category 1 is Pa, and the probability that the text to be classified belongs to the text category 2 is Pb, where Pa is greater than Pb, and then the text category 1 is determined as the text category to which the text to be classified belongs.
Example two
An embodiment of the present invention provides a text classification apparatus, as shown in fig. 5, including:
a first determining module 501, configured to determine, for each key feature word in the text to be classified, a word category corresponding to the key feature word according to a sample feature word corresponding to the word category in the sample library; and
a second determining module 502, configured to determine, according to a correspondence between a text category in the sample library and a sample text, a text category corresponding to the sample text having the key feature word, and use the determined text category as the text category corresponding to the key feature word, where a sample text corresponding to each text category includes a plurality of sample feature words;
a third determining module 503, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic word in any text category;
a fourth determining module 504, configured to determine, according to the weight of each key feature word in each corresponding text category, a text category to which the text to be classified belongs.
Optionally, the text classification apparatus provided in the embodiment of the present invention further includes:
a storage module 505, configured to pre-store word class weights, where each word class weight is used to represent a sum of weights of sample feature words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
Optionally, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
Optionally, the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
Optionally, the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
Figure BDA0001302663070000171
wherein x represents the text to be classified, y i Representing said any one of the text categories, P (y) i | x) indicates that the text to be classified belongs to the text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
EXAMPLE III
The embodiment of the invention provides a nonvolatile computer storage medium, wherein an executable program is stored in the computer storage medium, and the executable program is executed by a processor to realize the steps of any text classification method in the first embodiment.
Example four
An embodiment of the present invention provides a text classification device, configured to execute any text classification method in the first embodiment, as shown in fig. 6, which is a schematic diagram of a hardware structure of the text classification device in the fourth embodiment of the present invention, where the text classification device may be a desktop computer, a portable computer, a smart phone, a tablet computer, and the like. Specifically, the text classification device may comprise a memory 601, a processor 602 and a computer program stored on the memory, the processor implementing the steps of any of the text classification methods of the first embodiment when executing the program. Memory 601 may include, among other things, read Only Memory (ROM) and Random Access Memory (RAM), and provides processor 602 with program instructions and data stored in memory 601.
Further, the text classification apparatus according to the fourth embodiment of the present invention may further include an input device 603 and an output device 604. The input device 603 may include a keyboard, mouse, touch screen, etc.; the output device 604 may include a Display apparatus, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), and the like. The memory 601, the processor 602, the input device 603, and the output device 604 may be connected by a bus or other means, and are exemplified by being connected by a bus in fig. 6.
Processor 602 invokes program instructions stored in memory 601 and performs the text classification method provided in one embodiment in accordance with the obtained program instructions.
By utilizing the text classification method, the text classification device, the text classification medium and the text classification equipment, the following beneficial effects are achieved:
the method comprises the steps of performing word category division on sample characteristic words in a sample library in advance, summarizing the sample characteristic words with the same characteristics but different expression modes into the sample characteristic words of the same word category, performing word category division on key characteristic words in a text to be classified, and determining the weight of the key characteristic words in any corresponding text category as the sum of the weights of the sample characteristic words belonging to the same word category as the key characteristic words in the text category, so that the weight of the key characteristic words in the text category is improved, and the accuracy of the text category to which the text to be classified belongs can be improved.
It should be noted that although several modules of the text classification apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the modules described above may be embodied in one module according to embodiments of the invention. Conversely, the features and functions of one module described above may be further divided into embodiments by a plurality of modules.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (12)

1. A method of text classification, comprising:
determining a word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
determining the weight of the key characteristic word under each corresponding text category, wherein the weight of the key characteristic word under any corresponding text category is as follows: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
2. The method of claim 1, further comprising:
pre-storing word class weights, wherein each word class weight is used for representing the sum of weights of sample characteristic words belonging to the same word class in the same text class;
determining the weight of the key feature word under any corresponding text category, including:
acquiring the word category weight corresponding to the word category to which the key characteristic word belongs from the stored word category weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
3. The method according to claim 1, wherein determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically comprises:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
4. The method according to claim 1, wherein determining the category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category specifically comprises:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
5. The method according to claim 4, characterized in that the probability that the text to be classified belongs to any text category is determined using the following formula:
Figure FDA0001302663060000021
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to the text category y i Probability of (A), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) Indicating that the jth key characteristic word is in the text category y i Conditional probability of (c).
6. A text classification apparatus, comprising:
the first determining module is used for determining the word category corresponding to each key feature word in the text to be classified according to the sample feature word corresponding to the word category in the sample library; and
the second determining module is used for determining a text category corresponding to the sample text with the key feature words according to the corresponding relation between the text category in the sample library and the sample text, and taking the determined text category as the text category corresponding to the key feature words, wherein the sample text corresponding to each text category comprises a plurality of sample feature words;
a third determining module, configured to determine a weight of the key feature word in each corresponding text category, where the weight of the key feature word in any corresponding text category is: the sum of the weights of the sample characteristic words belonging to the same word class as the key characteristic word in any text class;
and the fourth determining module is used for determining the text category to which the text to be classified belongs according to the weight of each key feature word under each corresponding text category.
7. The apparatus of claim 6, further comprising:
the storage module is used for pre-storing word class weights, wherein each word class weight is used for representing the sum of the weights of sample characteristic words belonging to the same word class in the same text class;
the third determining module, when determining the weight of the key feature word in any corresponding text category, is specifically configured to:
acquiring the word class weight corresponding to the word class to which the key characteristic word belongs from the stored word class weights;
and taking the acquired word category weight as the weight of the key characteristic word in any text category.
8. The apparatus of claim 6, wherein the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the sum of the weights of the key feature words corresponding to the text category under the text category and the sum of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
and determining the text category to which the text to be classified belongs according to the occurrence probability of the text to be classified in the sample text corresponding to the text category.
9. The apparatus of claim 6, wherein the fourth determining module is specifically configured to:
for each text category corresponding to each key feature word, determining the occurrence probability of the text to be classified in the sample text corresponding to the text category according to the product of the weights of the key feature words corresponding to the text category under the text category and the product of the conditional probabilities of the non-key feature words in the text to be classified under the text category;
determining the probability of the text to be classified belonging to the text category according to the occurrence probability of the text to be classified in the sample text corresponding to the text category, the prior probability of the text category and the prior probability of the text to be classified;
and determining the text category to which the text to be classified belongs according to the probability of the text to be classified belonging to each text category.
10. The apparatus according to claim 9, wherein the fourth determining module is specifically configured to determine the probability that the text to be classified belongs to any text category by using the following formula:
Figure FDA0001302663060000041
wherein x represents the text to be classified, y i Representing said any text category, P (y) i | x) indicates that the text to be classified belongs to a text category y i Probability of (c), P (y) i ) Representing a text category y i A prior probability of j Representing the jth key feature word in the text to be classified, m representing the number of the key feature words in the text to be classified, P (a) j ) Indicating that the jth key characteristic word is in the text category y i Weight of, P (x) represents the prior probability of the text to be classified, a t Representing the t-th non-key feature word in the text to be classified, n representing the number of the non-key feature words in the text to be classified, P (a) t |y i ) The j-th key feature word is represented in the text category y i Conditional probability of (c).
11. A non-transitory computer storage medium storing an executable program for execution by a processor to perform the steps of the method of any one of claims 1 to 5.
12. A text classification device comprising a memory, a processor and a computer program stored on the memory, the processor implementing the steps of the method of any one of claims 1 to 5 when executing the program.
CN201710370523.9A 2017-05-23 2017-05-23 Text classification method, device, medium and equipment Active CN108959237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710370523.9A CN108959237B (en) 2017-05-23 2017-05-23 Text classification method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710370523.9A CN108959237B (en) 2017-05-23 2017-05-23 Text classification method, device, medium and equipment

Publications (2)

Publication Number Publication Date
CN108959237A CN108959237A (en) 2018-12-07
CN108959237B true CN108959237B (en) 2022-11-22

Family

ID=64494216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710370523.9A Active CN108959237B (en) 2017-05-23 2017-05-23 Text classification method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN108959237B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717040A (en) * 2019-09-18 2020-01-21 平安科技(深圳)有限公司 Dictionary expansion method and device, electronic equipment and storage medium
CN115759072B (en) * 2022-11-21 2024-03-12 时趣互动(北京)科技有限公司 Feature word classification method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7836061B1 (en) * 2007-12-29 2010-11-16 Kaspersky Lab, Zao Method and system for classifying electronic text messages and spam messages
CN104915356A (en) * 2014-03-13 2015-09-16 中国移动通信集团上海有限公司 Text classification correcting method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7836061B1 (en) * 2007-12-29 2010-11-16 Kaspersky Lab, Zao Method and system for classifying electronic text messages and spam messages
CN104915356A (en) * 2014-03-13 2015-09-16 中国移动通信集团上海有限公司 Text classification correcting method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合词性的短文本相似度算法及其在文本分类中的应用;黄贤英等;《电讯技术》;20170128(第01期);全文 *

Also Published As

Publication number Publication date
CN108959237A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
CN114416927B (en) Intelligent question-answering method, device, equipment and storage medium
CN108287864B (en) Interest group dividing method, device, medium and computing equipment
CN112101437B (en) Fine granularity classification model processing method based on image detection and related equipment thereof
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
US20190057084A1 (en) Method and device for identifying information
CN108536739B (en) Metadata sensitive information field identification method, device, equipment and storage medium
CN110390106B (en) Semantic disambiguation method, device, equipment and storage medium based on two-way association
CN112632257A (en) Question processing method and device based on semantic matching, terminal and storage medium
CN112506864B (en) File retrieval method, device, electronic equipment and readable storage medium
CN113609847B (en) Information extraction method, device, electronic equipment and storage medium
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN115168537B (en) Training method and device for semantic retrieval model, electronic equipment and storage medium
CN103309984A (en) Data processing method and device
CN116204672A (en) Image recognition method, image recognition model training method, image recognition device, image recognition model training device, image recognition equipment, image recognition model training equipment and storage medium
CN108182200B (en) Keyword expansion method and device based on semantic similarity
CN108959237B (en) Text classification method, device, medium and equipment
EP3992814A2 (en) Method and apparatus for generating user interest profile, electronic device and storage medium
CN114120166A (en) Video question and answer method and device, electronic equipment and storage medium
CN117333889A (en) Training method and device for document detection model and electronic equipment
CN112559713B (en) Text relevance judging method and device, model, electronic equipment and readable medium
US20240095286A1 (en) Information processing apparatus, classification method, and storage medium
CN114970666A (en) Spoken language processing method and device, electronic equipment and storage medium
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN113901302A (en) Data processing method, device, electronic equipment and medium
CN113392124B (en) Structured language-based data query method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant