CN115587163A

CN115587163A - Text classification method and device, electronic equipment and storage medium

Info

Publication number: CN115587163A
Application number: CN202211214632.9A
Authority: CN
Inventors: 简仁贤; 刘影; 吴文杰
Original assignee: Emotibot Technologies Ltd
Current assignee: Emotibot Technologies Ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-10

Abstract

The application provides a text classification method, a text classification device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a text to be classified; extracting a plurality of key words of a text to be classified; acquiring first classification rules corresponding to the first classification labels respectively and a first word list corresponding to each first classification rule; when each first word list corresponding to the first target classification rule comprises a target keyword, determining that the text to be classified conforms to the first target classification rule; the target keyword is one or more keywords in a plurality of keywords; and determining a first target classification label corresponding to the first target classification rule as a classification label of the text to be classified. Through the technical scheme provided by the embodiment of the application, the purpose of automatic classification can be achieved, and the accuracy of the classification label is high.

Description

Text classification method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a text classification method and apparatus, an electronic device, and a storage medium.

Background

The text classification is to judge the classification of the input text under a fixed classification label system, and is an important component of the natural language processing problem.

In the related art, when a text is classified, a large number of training texts are collected, labels of the training texts are calibrated, the texts with the labels calibrated are input into a text classification model to be trained, the text classification model automatically extracts text features of the training texts through a text classification algorithm, and classification labels corresponding to the training texts are output based on the text features; and then calculating a loss function value of the text classification model to be trained based on the classification label corresponding to the output training text and the label of the calibrated training text, obtaining the trained text classification model when the loss function value is smaller than a preset value, and realizing text classification through the trained text classification model.

However, in practical application, for some classification categories, the number of training texts that can be collected is small, so that the accuracy of the classification labels output by the trained text classification model is low.

Disclosure of Invention

In order to solve the technical problem, the application shows a text classification method, a text classification device, an electronic device and a storage medium.

In a first aspect, an embodiment of the present application provides a text classification method, including:

acquiring a text to be classified;

extracting a plurality of key words of the text to be classified;

acquiring first classification rules corresponding to the first classification labels respectively and a first word list corresponding to each first classification rule; the first classification rule corresponding to each first classification label and the first word list corresponding to each first classification rule are obtained by analyzing a training text in advance, when each first word list corresponding to one first classification rule comprises a keyword of the training text, the training text conforms to the first classification rule, and the classification label of the training text comprises the first classification label corresponding to the first classification rule;

when each first word list corresponding to a first target classification rule comprises a target keyword, determining that the text to be classified conforms to the first target classification rule; the target keyword is one or more keywords in the plurality of keywords; the first target classification rule is any first classification rule corresponding to any classification label;

and determining a first target classification label corresponding to the first target classification rule as a classification label of the text to be classified.

Optionally, the method further includes:

for any first classification rule in a plurality of first classification rules, when the first classification rule corresponds to a plurality of first word lists, sequencing the first word lists according to the sequence of the priorities of words included in the first word lists from high to low to obtain a plurality of sequenced first word lists; the priority of the words is obtained by analyzing the training text in advance;

sequentially judging whether each first word list comprises a target keyword or not according to the sequence of the sequenced plurality of first word lists;

and when each first word list comprises the target key words, executing a step of determining that the text to be classified conforms to the first target classification rule.

Optionally, the determining that the text to be classified conforms to the first target classification rule includes:

determining the position of each target keyword in the text to be classified;

calculating the distance between every two adjacent target keywords based on the positions of every two adjacent target keywords in the text to be classified; the distance between every two adjacent target keywords is used for representing the semantic association degree between the corresponding two adjacent target keywords, and the distance is in inverse proportion to the semantic association degree;

and when the distance between every two adjacent target keywords is smaller than a preset distance, determining that the text to be classified conforms to the first target classification rule.

Optionally, the method further includes:

acquiring second classification rules corresponding to the plurality of second classification labels respectively and a second word list corresponding to each second classification rule; the second classification rule corresponding to each second classification label and the second word list corresponding to each second classification rule are obtained by analyzing a target text in advance, the target text comprises a training text and/or a test text, when each second word list corresponding to one second classification rule comprises a keyword of the target text, the target text conforms to the second classification rule, and the classification label of the target text does not comprise the second classification label corresponding to the second classification rule;

determining whether the classification label of the text to be classified comprises a second classification label;

when the classification label of the text to be classified comprises a second classification label, determining whether the text to be classified conforms to a second classification rule corresponding to the second classification label;

when each second word list corresponding to the second classification rule comprises a target keyword, determining that the text to be classified conforms to the second classification rule;

and deleting the second classification label included in the classification label of the text to be classified.

Optionally, the method further includes:

before extracting the key words in the text to be classified, inputting the text to be classified into a text classification model obtained by pre-training;

when the target conditions are met, executing the step of extracting the keywords in the text to be classified; the target condition is any one of the following conditions: the text classification model does not output classification labels, the accuracy of the classification labels output from the text classification model is lower than a preset accuracy, and the recall rate of the classification labels output from the text classification model is lower than a preset recall rate.

In a second aspect, an embodiment of the present application provides a text classification apparatus, including:

the text acquisition module is used for acquiring texts to be classified;

the keyword extraction module is used for extracting a plurality of keywords of the text to be classified;

the first information acquisition module is used for acquiring first classification rules corresponding to the plurality of first classification labels respectively and a first word list corresponding to each first classification rule; the first classification rule corresponding to each first classification label and the first word list corresponding to each first classification rule are obtained by analyzing a training text in advance, when each first word list corresponding to one first classification rule comprises a keyword of the training text, the training text conforms to the first classification rule, and the classification label of the training text comprises the first classification label corresponding to the first classification rule;

the first classification rule determining module is used for determining that the text to be classified conforms to a first target classification rule when each first word list corresponding to the first target classification rule comprises a target keyword; the target keyword is one or more keywords in the plurality of keywords; the first target classification rule is a first classification rule corresponding to any classification label;

and the classification label determining module is used for determining a first target classification label corresponding to the first target classification rule as a classification label of the text to be classified.

Optionally, the method further includes:

the word list ordering module is used for ordering the first word lists according to the sequence of the priority of words included in the first word lists from high to low to obtain a plurality of ordered first word lists for any first classification rule in the first classification rules; the priority of the words is obtained by analyzing the training text in advance;

the keyword judgment module is used for sequentially judging whether each first word list comprises a target keyword according to the sequence of the sequenced first word lists;

and when each first word list comprises the target keyword, triggering the classification label determining module to execute the step of determining that the text to be classified conforms to the first target classification rule.

Optionally, the classification label determining module is specifically configured to:

determining the position of each target keyword in the text to be classified;

Optionally, the method further includes:

the second information acquisition module is used for acquiring second classification rules corresponding to the plurality of second classification labels respectively and a second word list corresponding to each second classification rule; the second classification rule corresponding to each second classification label and the second word list corresponding to each second classification rule are obtained by analyzing a target text in advance, the target text comprises a training text and/or a test text, when each second word list corresponding to one second classification rule comprises a keyword of the target text, the target text conforms to the second classification rule, and the classification label of the target text does not comprise the second classification label corresponding to the second classification rule;

the classification label judging module is used for determining whether the classification label of the text to be classified comprises a second classification label;

the classification rule judging module is used for determining whether the text to be classified conforms to a second classification rule corresponding to a second classification label when the classification label of the text to be classified comprises the second classification label;

the second classification rule determining module is used for determining that the text to be classified conforms to the second classification rule when each second word list corresponding to the second classification rule comprises a target keyword;

and the classification label removing module is used for deleting the second classification label included by the classification label of the text to be classified.

Optionally, the method further includes:

the text input module is used for inputting the texts to be classified into a text classification model obtained by pre-training before the keywords in the texts to be classified are extracted by the keyword extraction module;

when the target conditions are met, triggering the keyword extraction module to execute the step of extracting the keywords in the text to be classified; the target condition is any one of the following conditions: the text classification model does not output classification labels, the accuracy of the classification labels output from the text classification model is lower than a preset accuracy, and the recall rate of the classification labels output from the text classification model is lower than a preset recall rate.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the text classification method according to the first aspect when executing the program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the text classification method according to the first aspect.

According to the technical scheme provided by the embodiment of the application, a plurality of word lists of the training samples are obtained by analyzing the language characteristics of the training samples, and the classification rules corresponding to each classification label and the word lists corresponding to each classification rule are obtained by analyzing. By extracting the keywords of the text to be classified, when each word list corresponding to a certain first classification rule comprises one or more keywords of the text to be classified, determining that the text to be classified conforms to the first classification rule, and determining the first classification label corresponding to the first classification rule as the classification label of the text to be classified. Therefore, the technical scheme provided by the embodiment of the application can achieve the purpose of automatic classification, and the accuracy of the classification label is high. When the number of training samples is small, the text classification accuracy is high through the scheme of the embodiment of the application, the classification efficiency is improved, and the cost for collecting a large number of training samples is reduced. In addition, by maintaining different word lists, the classification effect can be adjusted, and generalization of different label systems can be realized.

Drawings

FIG. 1 is a diagram illustrating the relationship between classification labels, classification rules, and vocabularies in an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a method for classifying texts according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating steps of another text classification method according to an embodiment of the present application;

FIG. 4 is a flowchart of the steps of one embodiment of S240 in FIG. 2;

FIG. 5 is a flowchart illustrating steps of another method for classifying text according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of an embodiment provided by the present application;

fig. 7 is a block diagram of a structure of a text classification apparatus according to an embodiment of the present application;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

In the related technology, when classifying texts, a large amount of training texts need to be collected, labels of the training texts are calibrated, the texts with the labels calibrated are input into a text classification model to be trained, the text classification model automatically extracts text features of the training texts through a text classification algorithm, and classification labels corresponding to the training texts are output based on the text features; and then calculating a loss function value of the text classification model to be trained based on the classification label corresponding to the output training text and the label of the calibrated training text, obtaining the trained text classification model when the loss function value is smaller than a preset value, and realizing text classification through the trained text classification model.

However, in practical application, for some classification labels, the number of training texts that can be collected is small, so that the accuracy of the classification labels output by the trained text classification model is low.

For example, if the classification labels are satisfied as a whole, less training texts can be collected, and if the text classification model is trained by using the small amount of training texts, the accuracy of the classification labels output by the trained text classification model is lower.

However, in the process of implementing the technical solution of the present application, the inventors find that the expressions of the collected training texts are relatively similar and the used vocabularies are relatively centralized by analyzing the training texts. At this time, the classification label of the training text may be determined by the technical solution of the embodiment of the present application, which will be described in detail in the following embodiments.

In order to solve the above problems in the related art, embodiments of the present application provide a text classification method, apparatus, electronic device, and storage medium.

For clarity of description, before describing the technical solutions of the present application, several concepts related to the technical solutions of the present application will be described first.

First, language features are introduced. After the text under a certain label category is analyzed, the syntactic structure of the semantic information conforming to the label category in the text, the specific words represented by the semantic information and how much semantic information is needed to label the label are judged. These syntactic structures, specific words, and semantic information may all be referred to as linguistic features, which need to be summarized from a portion of the annotated training text via manual analysis.

Second, a vocabulary is introduced. The word list is the expression of language characteristic concretization, and all words in one word list are candidate words which accord with the expression of a certain language characteristic. If a candidate word is included in the text, the text has the language features corresponding to the candidate word.

Third, the relationships between the classification labels, the classification rules, and the vocabularies are introduced. Specifically, word list description is carried out on language features required in each rule through text analysis on the basis of a classified label system, word lists with the same type of language structures are combined, a classification rule set and a word list set required in each label are manually set, all classification rules form a rule device, and the rule device generally comprises a plurality of classification rules.

Each classification label corresponds to one or more classification rules, depending on the linguistic characteristics of the text under the classification label category. The number of the corresponding word lists in each classification rule is one or more, and the language features of the text can be clearly described according to how much semantic information. If tag 1 corresponds to four word lists, the four word lists correspond to one classification rule, and the classification rule can be labeled as rule11.

The relationship between the classification labels, the classification rules, and the vocabularies will be illustrated in connection with fig. 1.

As shown in fig. 1, assume that there are n tags under the tag hierarchy, tag 1, tag 2, tag 3, \8230;, tag n, respectively.

Wherein, the label 1 corresponds to a classification Rule11; rule11 corresponds to four word lists, namely word list 111, namely, dit 111, word list 112, namely, dit 112, word list 113, namely, dit 113, and word list 114, namely, dit 114. If a keyword in a certain text is present in the vocabulary 111, the vocabulary 112, the vocabulary 113 and the vocabulary 114 at the same time, the text conforms to Rule11, and the classification label of the text includes a label 1.

The label 2 corresponds to classification rules Rule21 and Rule22; rule21 corresponds to a word list, namely word list 211, namely fact 211, and Rule22 corresponds to two word lists, namely word list 221, namely fact 221, and word list 222, namely fact 222. If a keyword in a certain text appears in the word list 211, the text conforms to Rule21, and the classification label of the text comprises a label 2; or, a keyword in a certain text appears in both the word list 221 and the word list 222, the text conforms to Rule22, and the classification label of the text includes label 2.

The label 3 corresponds to a classification Rule31; rule31 corresponds to three word lists, namely a word list 311, namely a fact 311, a word list 312, namely a fact 312, and a word list 313, namely a fact 313. If a keyword in a certain text appears in the vocabulary 311, the vocabulary 312 and the vocabulary 313 at the same time, the text conforms to Rule31, and the classification label of the text comprises a label 3.

The label n corresponds to classification rules Rulen1, rulen2 and Rulen3; rulen1 corresponds to a word list which is n11, namely dictn11; rulen2 corresponds to two word lists, namely a word list n21, namely dictn21, and a word list n22, namely dictn22; rulen3 corresponds to three vocabularies, namely, vocabulary n31, i.e., dictn31, vocabulary n32, i.e., dictn32, and vocabulary n33, i.e., dictn33. If a keyword in a certain text appears in a word list n11, the text conforms to Rulen1, and the classification label of the text contains a label n; or, when the keywords in a certain text appear in the vocabulary n21 and the vocabulary n22 at the same time, the text conforms to the rule2, and the classification label of the text includes the label n; or, the keywords in a certain text appear in the vocabulary n31, the vocabulary n32 and the vocabulary n33 at the same time, the text conforms to the rule3, and the classification label of the text comprises the label n.

After introducing several concepts related to the technical solutions of the present application, the technical solutions of the embodiments of the present application will be described in detail below.

In a first aspect, a text classification method provided in an embodiment of the present application is first explained in detail.

As shown in fig. 2, a text classification method provided in an embodiment of the present application may include the following steps:

s210, obtaining the text to be classified.

The sample to be classified may be a test text or a text carried in a user request, which is reasonable, and this is not specifically limited in the embodiment of the present application.

Before executing S220 to extract the keywords in the text to be classified, in an embodiment, the following steps may also be executed, which are step a and step b, respectively:

step a, inputting a text to be classified into a text classification model obtained by pre-training.

Specifically, after the text to be classified is obtained, the text to be classified may be input into a text classification model obtained through pre-training, the text classification model obtained through pre-training is a traditional text classification model, and the text classification model is used to identify the classification label of the text to be classified. If the text classification model can identify the classification label of the text to be classified, the text classification model outputs the classification label of the text to be classified; if the text classification model does not recognize the classification label of the text to be classified, the text classification model does not output the classification label of the text to be classified.

And b, when the target conditions are met, executing the step of extracting the keywords in the text to be classified.

Wherein the target condition is any one of the following conditions: the text classification model does not output the classification tags, the accuracy rate of the classification tags output from the text classification model is lower than a preset accuracy rate, and the recall rate of the classification tags output from the text classification model is lower than a preset recall rate.

Specifically, after the text to be classified is input into the text classification model obtained by pre-training, the following three situations can occur:

the first case is: and if the text classification model does not identify the classification label of the text to be classified, the text classification model does not output the classification label.

The second case is: the text classification model may output classification labels for the text to be classified. However, the text to be classified includes a plurality of classification tags, the text classification model outputs only one classification tag or a part of the classification tags, and does not output all possible classification tags, and at this time, the recall rate output by the text classification model is low, that is, the recall rate is less than the preset recall rate.

The third case is: the text classification model can output the classification labels of the texts to be classified, but the classification labels output from the text classification model contain wrong classification labels, and at the moment, the accuracy rate output by the text classification model is low, namely the accuracy rate is smaller than the preset accuracy rate.

When any one of the three situations occurs, the technical scheme of the embodiment of the application needs to further determine the classification label of the text to be classified, so as to improve the recall rate and accuracy rate of determining the classification label.

It can be seen that, according to the technical solution provided by this embodiment, when the text classification model does not output a classification tag, and/or the recall rate or accuracy of the classification tag output from the text classification model is low, S220 to S250 are executed, and the classification tag output from the text classification model can be optimized and supplemented, thereby contributing to improving the recall rate and accuracy of the text classification tag.

S220, extracting a plurality of keywords of the text to be classified.

After the text to be classified is obtained, a plurality of keywords in the text to be classified can be extracted. The plurality of keywords may be nouns, adjectives or adverbs included in the text to be classified. Specifically, the words with positive or negative meanings may be a plurality of keywords in the text to be classified. For example, the word "very dislike" may be a keyword in the text to be classified.

S230, acquiring first classification rules corresponding to the plurality of first classification labels respectively and a first word list corresponding to each first classification rule.

The first classification rule corresponding to each first classification label and the first word list corresponding to each first classification rule are obtained by analyzing a training text in advance, when each first word list corresponding to one first classification rule comprises a keyword in the training text, the training text conforms to the first classification rule, and the classification label of the training text comprises the first classification label corresponding to the first classification rule.

Specifically, as shown in fig. 1, the training text is analyzed in advance to obtain a classification rule corresponding to each classification label and a word list corresponding to each classification rule. Each classification label corresponds to one or more classification rules and depends on the language characteristics of the text under the classification label category; and the number of the corresponding word lists in each classification rule is one or more, and the language features of the text can be clearly described according to the semantic information. When each first word list corresponding to one classification rule comprises a keyword in a certain training text, determining that the training text conforms to the first classification rule, and the classification label of the training text comprises the classification label corresponding to the classification rule. These rules may be referred to as new tag rules, and the rulers corresponding to the new tag rules may be referred to as new tag rulers, where the new tag rulers include multiple new tag rules.

For clarity of the description of the scheme, the classification tag is referred to as a first classification tag, the classification rule corresponding to the classification tag is referred to as a first classification rule, and the vocabulary corresponding to the classification rule is referred to as a first vocabulary.

S240, when each first word list corresponding to the first target classification rule comprises a target keyword, determining that the text to be classified conforms to the first target classification rule.

The target keywords are one or more keywords in a plurality of keywords; the first target classification rule is any first classification rule corresponding to any classification label.

Specifically, after extracting a plurality of keywords of the text to be classified, for each first vocabulary corresponding to any first target classification rule, it may be determined whether one or more keywords exist in the plurality of keywords in the first vocabulary, and if each first vocabulary includes one or more keywords in the plurality of keywords, it may be determined that the text to be classified conforms to the first target classification rule.

And S250, determining the first target classification label corresponding to the first target classification rule as the classification label of the text to be classified.

Specifically, when it is determined through S240 that the text to be classified conforms to the first target classification rule, it indicates that the classification label of the text to be classified includes the first target classification label corresponding to the first target classification rule, and therefore, the first target classification label corresponding to the first target classification rule is determined as the classification label of the text to be classified.

According to the technical scheme provided by the embodiment of the application, a plurality of word lists of the training samples are obtained by analyzing the language characteristics of the training samples, and the classification rules corresponding to each classification label and the word lists corresponding to each classification rule are obtained by analyzing. By extracting the keywords of the text to be classified, when each word list corresponding to a certain first classification rule comprises one or more keywords of the text to be classified, determining that the text to be classified conforms to the first classification rule, and determining the first classification label corresponding to the first classification rule as the classification label of the text to be classified. Therefore, the automatic classification can be achieved through the technical scheme provided by the embodiment of the application, and the accuracy of the classification label is high. When the number of training samples is small, the text classification accuracy is high through the scheme of the embodiment of the application, the classification efficiency is improved, and the cost for collecting a large number of training samples is reduced. In addition, by maintaining different word lists, the classification effect can be adjusted, and generalization of different label systems can be realized.

On the basis of the embodiment shown in fig. 2, in an implementation manner, as shown in fig. 3, the text classification method further includes:

s240a, for any first classification rule in the plurality of first classification rules, when the first classification rule corresponds to the plurality of first word lists, the plurality of first word lists are sorted according to the order of the priorities of the words included in the plurality of first word lists from high to low, so as to obtain a plurality of sorted first word lists.

Wherein, the priority of the words is obtained by analyzing the training text in advance.

Specifically, in practical applications, each text may include a plurality of keywords, some words in the plurality of keywords are more important, for example, a noun, an adverb with a positive or negative meaning, and the like are more important words, and the more important words may be determined as words with higher priority.

When the first classification rule corresponds to a plurality of first vocabularies, for each first vocabularies, the priority of the first vocabularies may be determined according to the importance of the words included in the first vocabularies. If the importance degree of the words included in a first word list is higher, the priority of the first word list can be determined to be higher; if a word included in a vocabulary is of low importance, it may be determined that the first vocabulary is of low priority.

After the priorities of the words are determined, the first word lists may be sorted according to the order of the priorities from high to low, so as to obtain the sorted first word lists. The sorted first vocabularies may be numbered, for example, the first vocabularies with the lowest priority are numbered 1, the numbers of the corresponding first vocabularies are sequentially increased according to the sequence from the lowest priority to the highest priority, and the first vocabularies with the highest priority have the largest number, that is, the plurality of vocabularies are numbered 1, 2, 3, \ 8230; \8230;.

And S240b, sequentially judging whether each first word list comprises the target keyword according to the sequence of the sequenced first word lists. When each first vocabulary includes the target keyword, S240 is performed to determine that the text to be classified conforms to the first target classification rule.

Specifically, after the plurality of ordered first vocabularies are obtained, whether one or more keywords in the text to be classified appear in the vocabularies or not may be sequentially determined for each first vocabularies according to the sequence of the plurality of ordered first vocabularies.

And when a certain first classification rule corresponds to one or more keywords of the text to be classified in the plurality of first word lists, determining that the text to be classified conforms to the first classification rule. And if the first word list which does not comprise the keywords of the text to be classified exists in the plurality of first word lists corresponding to a certain first classification rule, determining that the text to be classified does not accord with the first classification rule.

It can be seen that, according to the technical solution provided by this embodiment, when the first classification rule corresponds to the plurality of first vocabularies, the plurality of first vocabularies may be sorted according to the priorities of the plurality of first vocabularies to obtain a plurality of sorted first vocabularies, and when the plurality of first vocabularies each include one or more keywords in the text to be classified, it is determined that the text to be classified conforms to the first classification rule, so as to accurately determine whether the text to be classified belongs to each first classification rule, and further, it is possible to accurately determine the classification label of the text to be classified.

Based on the embodiment shown in fig. 2, in an implementation manner, as shown in fig. 4, S240, determining that the text to be classified conforms to the first target classification rule includes:

s241, determining the position of each target keyword in the text to be classified.

Specifically, the target keywords are keywords existing in the first vocabulary in the text to be classified, and a position of each target keyword in the text to be classified may be labeled. For example, it is recorded that each target keyword is the few th character of the text to be classified.

S242, calculating the distance between every two adjacent target keywords according to the positions of every two adjacent target keywords in the text to be classified.

The distance between every two adjacent target keywords is used for representing the semantic association degree between the two corresponding adjacent target keywords, and the distance is in inverse proportion to the semantic association degree.

Specifically, after the position of each target keyword in the text to be classified is determined, the distance between any two adjacent target keywords can be calculated. For example, the first target keyword and the second target keyword are separated by 5 characters, and then the distance between the two may be 5 characters. The third target keyword is 10 characters away from the fourth target keyword, and the distance between the third target keyword and the fourth target keyword may be 10 characters.

It can be understood that the smaller the distance between two adjacent target keywords is, the higher the semantic association degree between the two is; correspondingly, the larger the distance between two adjacent target keywords is, the lower the semantic association degree between the two target keywords is. For example, if a user inputs a very long text, the distance between two adjacent target keywords may be relatively large.

And S243, when the distance between every two adjacent target keywords is smaller than the preset distance, determining that the text to be classified accords with a first target classification rule.

The preset distance may be set according to actual conditions, for example, the preset distance may set 8 characters.

Specifically, when the distance between every two adjacent target keywords is smaller than the preset distance, it is indicated that the semantic association degree between every two adjacent target keywords is high, and therefore it is determined that the text to be classified conforms to the first target classification rule.

It can be seen that, according to the technical scheme provided by the embodiment, when each first vocabulary corresponding to the first classification rule includes the target keyword, the distance between the adjacent target keywords is further determined according to the positions of the target keywords in the text to be classified, and when the distance between any two adjacent target keywords is small, that is, the semantic association degree of any two adjacent target keywords is high, it is determined that the text to be classified conforms to the first classification rule, so that whether the text to be classified belongs to each first classification rule is determined more accurately, and then the classification label of the text to be classified can be determined more accurately.

As can be seen from the description of the above embodiment, in the above embodiment, a new classification tag is added to the text to be classified, and in order to further improve the accuracy of the classification tag of the text to be classified, it may also be determined whether the classification tag output by the classification model includes an error tag or whether the new classification tag is accurate, so that the embodiment of the present application further determines whether the classification tag of the text to be classified belongs to the classification tag that should be removed.

Therefore, on the basis of the embodiment shown in any one of fig. 2 to fig. 4, in an implementation manner, the text classification method, as shown in fig. 5, may further include the following steps:

s260, second classification rules corresponding to the plurality of second classification labels respectively and a second word list corresponding to each second classification rule are obtained.

The second classification rule corresponding to each second classification label and the second vocabulary corresponding to each second classification rule are obtained by analyzing a target text in advance, the target text comprises a training text and/or a test text, when each second vocabulary corresponding to one second classification rule comprises a keyword of the target text, the target text conforms to the second classification rule, and the classification label of the target text does not comprise the second classification label corresponding to the second classification rule.

Specifically, the training text and/or the test text are analyzed in advance to obtain second classification labels, namely error labels, second classification rules corresponding to each second classification label, and a second vocabulary corresponding to each second classification rule. Each second classification label corresponds to one or more second classification rules; and the number of word lists corresponding to each second classification rule is one or more, and when the training text and/or the keywords of the test text appear in each second word list corresponding to a certain second classification rule, the classification label of the text is not the classification label corresponding to the second classification rule.

S270, whether the classification labels of the texts to be classified comprise second classification labels or not is determined.

After a plurality of second classification labels are obtained, whether the existing classification labels of the text to be classified comprise the second classification labels or not can be determined, and if the existing classification labels of the text to be classified do not comprise the second classification labels, processing is not carried out; if the existing classification tags of the text to be classified include the second classification tag, S280 is performed.

S280, when the classification label of the text to be classified comprises a second classification label, determining whether the text to be classified conforms to a second classification rule corresponding to the second classification label.

And S290, when each second word list corresponding to the second classification rule comprises the target keyword, determining that the text to be classified conforms to the second classification rule.

When the classification label of the text to be classified comprises a second classification label, one or more second classification rules corresponding to the second classification label are obtained, and for any second classification rule, when the second classification rule corresponds to a plurality of second word lists, the second word lists are sequenced according to the sequence from high to low of the priority of words included in the second word lists, so that the sequenced second word lists are obtained. Wherein, the priority of the words is obtained by analyzing the training texts and/or the test texts in advance.

The priority level of a word may be determined as follows:

each text may include a plurality of keywords, some of which are more important, for example, nouns or adverbs with positive or negative meanings, etc. are more important words, and the more important words may be determined as words with higher priority.

When the second classification rule corresponds to a plurality of second vocabularies, for each second vocabularies, the priority of the second vocabularies may be determined according to the importance degree of the words included in the second vocabularies. If the importance degree of the words included in a second word list is higher, the priority of the second word list can be determined to be higher; if a word included in one vocabulary is of lower importance, it may be determined that the second vocabulary is of lower priority.

After the priorities of the words are determined, the second word lists may be sorted according to the order of the priorities from high to low, so as to obtain the sorted second word lists. The sorted second vocabularies may be numbered, for example, the second vocabularies with the lowest priority are numbered as 1, the numbers of the corresponding second vocabularies are sequentially increased from the lowest priority to the highest priority, and the number of the second vocabularies with the highest priority is the largest, that is, the numbers of the second vocabularies are numbered as 1, 2, 3, \\ 8230 \ 8230;.

After the plurality of ordered second word lists are obtained, whether the target keywords in the text to be classified appear in the second word list or not can be sequentially judged for each second word list according to the sequence of the plurality of ordered second word lists; wherein a target keyword is one or more keywords.

And when a plurality of second word lists corresponding to a certain second classification rule all comprise target keywords of the text to be classified, determining that the text to be classified conforms to the second classification rule. And if the second word list which does not comprise the keywords of the text to be classified exists in the plurality of second word lists corresponding to a certain second classification rule, determining that the text to be classified does not accord with the second classification rule.

In addition, when a plurality of second word lists corresponding to a certain second classification rule all comprise target keywords of the text to be classified, the positions of the target keywords in the text to be classified can be determined; calculating the distance between every two adjacent target keywords based on the positions of every two adjacent target keywords in the text to be classified; when the distance between every two adjacent target keywords is smaller than the preset distance, determining that the text to be classified conforms to a second classification rule; the preset distance may be determined according to an actual situation, which is not specifically limited in the embodiment of the present application.

And S2100, deleting the second classification label included in the classification label of the text to be classified.

Specifically, when the text to be classified is judged to conform to a certain second classification rule, the classification label of the text to be classified includes the second classification label corresponding to the classification rule.

Therefore, according to the technical scheme provided by the embodiment of the application, after the classification label of the text to be classified is determined, whether the classification label of the text to be classified comprises the second classification label or not can be further judged, namely the removal classification label is removed, when the classification label of the text to be classified comprises the removal classification label, whether the text to be classified meets one or more second classification rules corresponding to the removal classification label or not is analyzed, if yes, the removal classification label is removed from the classification label of the text to be classified, and the accuracy of the classification label of the text to be classified is further improved.

For clarity of description, the technical solutions of the embodiments of the present application will be described in detail below with reference to specific examples.

As shown in fig. 6, labels of text and "none" are input, that is, the classification label of text cannot be recognized by the conventional text classification model.

And inputting the text and the label without into a new label regurator, and judging whether the text meets a certain classification rule in the regurator or not. The newly added tags refer to a plurality of first classification tags in the above embodiment, and the rule device of the newly added tags is composed of first classification rules corresponding to the first classification tags.

If the text meets a certain classification rule in the rulers, adding labelx as the label of the text at the label position, otherwise, keeping the label 'none'.

Then, judging whether the newly added label labelx is in the label removing device or not; the label in the removed labeler is a label to be removed, that is, the second classification label described in the above embodiment.

And when the labelx is in the tag removing device, judging whether the text meets a tag removing rule device, and otherwise, outputting the text and a tag result. And the label removing rulers consist of second classification rules corresponding to the second classification labels.

If the text meets the rule device for removing the label, removing the labelx at the position of the label, and outputting the text and the label result after the labelx is removed; and if the text does not meet the rule device for removing the label, outputting the text and the label result.

It is noted that, for simplicity of explanation, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary and that no action is necessarily required in this application.

In a second aspect, an embodiment of the present application provides a text classification apparatus, as shown in fig. 7, including:

a text obtaining module 710, configured to obtain a text to be classified;

a keyword extraction module 720, configured to extract a plurality of keywords of the text to be classified;

the first information obtaining module 730 is configured to obtain first classification rules corresponding to the plurality of first classification tags, and a first vocabulary corresponding to each first classification rule; the first classification rule corresponding to each first classification label and the first word list corresponding to each first classification rule are obtained by analyzing a training text in advance, when each first word list corresponding to one first classification rule comprises a keyword of the training text, the training text conforms to the first classification rule, and the classification label of the training text comprises the first classification label corresponding to the first classification rule;

a first classification rule determining module 740, configured to determine that the text to be classified conforms to a first target classification rule when each first vocabulary corresponding to the first target classification rule includes a target keyword; the target keyword is one or more keywords in the plurality of keywords; the first target classification rule is a first classification rule corresponding to any classification label;

a classification label determining module 750, configured to determine a first target classification label corresponding to the first target classification rule as a classification label of the text to be classified.

According to the technical scheme provided by the embodiment of the application, a plurality of word lists of the training samples are obtained by analyzing the language features of the training samples, and the classification rules corresponding to each classification label and the word lists corresponding to each classification rule are obtained by analyzing. By extracting the keywords of the text to be classified, when each word list corresponding to a certain first classification rule comprises one or more keywords of the text to be classified, determining that the text to be classified conforms to the first classification rule, and determining the first classification label corresponding to the first classification rule as the classification label of the text to be classified. Therefore, the automatic classification can be achieved through the technical scheme provided by the embodiment of the application, and the accuracy of the classification label is high. When the number of training samples is small, the text classification accuracy is high through the scheme of the embodiment of the application, the classification efficiency is improved, and the cost for collecting a large number of training samples is reduced. In addition, by maintaining different word lists, the classification effect can be adjusted, and generalization of different label systems can be realized.

Optionally, the method further includes:

the word list ordering module is used for ordering the first word lists according to the sequence of the priorities of words included in the first word lists from high to low to obtain a plurality of ordered first word lists for any first classification rule in the first classification rules; the priority of the words is obtained by analyzing the training text in advance;

determining the position of each target keyword in the text to be classified;

Optionally, the method further includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

In a third aspect, an embodiment of the present application provides an electronic device, as shown in fig. 8, including a memory 810, a processor 820, and a computer program stored on the memory and operable on the processor, where the processor implements the steps of the text classification method according to the first aspect when executing the program.

The embodiments in the present specification are all described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same and similar between the embodiments may be referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "include", "including" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article, or terminal device including a series of elements includes not only those elements but also other elements not explicitly listed or inherent to such process, method, article, or terminal device. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

The text classification method, the text classification device, the electronic device and the storage medium provided by the application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of text classification, comprising:

acquiring a text to be classified;

extracting a plurality of key words of the text to be classified;

acquiring first classification rules respectively corresponding to the plurality of first classification labels and a first word list corresponding to each first classification rule; the first classification rule corresponding to each first classification label and the first word list corresponding to each first classification rule are obtained by analyzing a training text in advance, when each first word list corresponding to one first classification rule comprises a keyword of the training text, the training text conforms to the first classification rule, and the classification label of the training text comprises the first classification label corresponding to the first classification rule;

2. The method of claim 1, further comprising:

and when each first word list comprises the target key words, executing the step of determining that the text to be classified conforms to the first target classification rule.

3. The method of claim 1, wherein the determining that the text to be classified complies with the first target classification rule comprises:

determining the position of each target keyword in the text to be classified;

4. The method of any of claims 1 to 3, further comprising:

acquiring second classification rules corresponding to the plurality of second classification labels respectively and a second word list corresponding to each second classification rule; the second classification rule corresponding to each second classification label and the second vocabulary corresponding to each second classification rule are obtained by analyzing a target text in advance, wherein the target text comprises a training text and/or a test text, when each second vocabulary corresponding to one second classification rule comprises a keyword of the target text, the target text conforms to the second classification rule, and the classification label of the target text does not comprise the second classification label corresponding to the second classification rule;

5. The method of any of claims 1 to 3, further comprising:

when the target conditions are met, executing the step of extracting the keywords in the text to be classified; the target condition is any one of the following conditions: the text classification model does not output a classification tag, the accuracy of the classification tag output from the text classification model is lower than a preset accuracy, and the recall rate of the classification tag output from the text classification model is lower than a preset recall rate.

6. A text classification apparatus, comprising:

the text acquisition module is used for acquiring texts to be classified;

the first classification rule determining module is used for determining that the text to be classified conforms to the first target classification rule when each first word list corresponding to the first target classification rule comprises a target keyword; the target keyword is one or more keywords in the plurality of keywords; the first target classification rule is a first classification rule corresponding to any classification label;

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 6, wherein the category label determination module is specifically configured to:

determining the position of each target keyword in the text to be classified;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the text classification method according to any one of claims 1 to 5 are implemented by the processor when executing the program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon which, when being executed by a processor, carries out the steps of a text classification method according to any one of claims 1 to 5.