CN113987136A - Method, device and equipment for correcting text classification label and storage medium - Google Patents

Method, device and equipment for correcting text classification label and storage medium Download PDF

Info

Publication number
CN113987136A
CN113987136A CN202111435681.0A CN202111435681A CN113987136A CN 113987136 A CN113987136 A CN 113987136A CN 202111435681 A CN202111435681 A CN 202111435681A CN 113987136 A CN113987136 A CN 113987136A
Authority
CN
China
Prior art keywords
text
classification
corrected
category
support rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111435681.0A
Other languages
Chinese (zh)
Inventor
孙小婉
蔡巍
张霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Original Assignee
Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd filed Critical Shenyang Neusoft Intelligent Medical Technology Research Institute Co Ltd
Priority to CN202111435681.0A priority Critical patent/CN113987136A/en
Publication of CN113987136A publication Critical patent/CN113987136A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method, a device, equipment and a storage medium for correcting a text classification label. The method comprises the following steps: calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier; predicting a second classification support rate of the text to be corrected under each classification type based on the association frequency of the constructed associated words in the text to be corrected under each classification type; and correcting the labeled classification label of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected in each classification category. The method and the device for correcting the text classification labels respectively analyze the real classification conditions of the texts to be corrected from two angles of the text classifier and the associated word aggregation, realize the correction of the text classification labels, ensure the accuracy of the text classification labels, and subsequently train the text classification model by adopting the corrected texts so as to improve the classification accuracy of the text classification model.

Description

Method, device and equipment for correcting text classification label and storage medium
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a method, a device, equipment and a storage medium for correcting a text classification label.
Background
With the rapid development of the deep learning technology, various texts can be classified by adopting a pre-trained text classification model, so that the text classification effect is remarkably improved. At this time, when training the corresponding text classification model, a large amount of text data with classification labels is required as a training sample of the text classification model.
Currently, considering that it is expensive and time-consuming for experts to label the classification labels of each text, for each text as a training sample, the classification labels labeled by the popular volunteers for each text are usually collected from crowdsourcing. However, the text classification labels collected from crowdsourcing are affected by various labeling noises, for example, compared with experts, popular volunteers have a weak understanding ability for each text, and have a problem of label labeling error, so that the text classification labels of training samples are inaccurate, and further, the classification accuracy of the trained text classification model for various texts is low.
Disclosure of Invention
The application provides a method, a device, equipment and a storage medium for correcting text classification labels, which are used for respectively analyzing the real classification condition of each text from two angles so as to correct the classification labels of each text, ensure the accuracy of the text classification labels and further improve the classification accuracy of a text classification model trained by using the text.
In a first aspect, an embodiment of the present application provides a method for correcting a text classification tag, where the method includes:
calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier;
predicting a second classification support rate of the text to be corrected under each classification category based on the association frequency of the constructed associated words in the text to be corrected under each classification category;
and correcting the labeled classification label of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected under each classification category.
Further, the calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier includes:
respectively inputting the text to be corrected into each trained text classifier to obtain a classification result matrix of the text to be corrected;
and calculating the first classification support rate of the text to be corrected under each classification category based on the occurrence frequency of the category name of each classification category in the classification result matrix.
Further, before calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier, the method further includes:
extracting corresponding test text sets in batches from the constructed text training library;
respectively training corresponding text classifiers by using the test text sets extracted in each batch to obtain the trained text classifiers;
wherein the test text sets extracted from different batches are different.
Further, the predicting a second classification support rate of the text to be corrected in each classification category based on the association frequency of the constructed associated words in the text to be corrected in each classification category includes:
counting the occurrence times of the associated vocabulary in the associated vocabulary set under each classification category in the text to be corrected;
and calculating a second classification support rate of the text to be corrected under each classification category according to the occurrence times under each classification category and the total number of words in the text to be corrected.
Further, before predicting a second classification support rate of the text to be corrected in each classification category based on the association frequency of the constructed associated words in the text to be corrected in each classification category, the method further includes:
the alternative vocabulary for the category name for each classification category is analyzed to construct a collection of related words under the classification category.
Further, the analyzing of the alternative vocabulary of category names for each classification category includes:
aiming at each classification category, searching out a target text under the classification category from a constructed text training library, wherein the target text contains a category name of the classification category;
hiding the category name of the classification category in each target text, inputting each hidden target text into a constructed semantic replacement model, and outputting a replaceable vocabulary of the category name of the classification category.
Further, the correcting the labeled classification label of the text to be corrected by using the first classification support rate and the second classification support rate of the text to be corrected in each classification category includes:
respectively determining a first belonged category of the text to be corrected with the maximum value within the first classification supporting rate and a second belonged category of the text to be corrected with the maximum value within the second classification supporting rate;
and if the first belonged category is the same as the second belonged category, correcting the labeled classification label of the text to be corrected into the category name of the same belonged category.
Furthermore, the number of the trained text classifiers is greater than or equal to the number of the set classification classes, and the text to be corrected is any text in the constructed text training library.
In a second aspect, an embodiment of the present application provides an apparatus for correcting a text classification tag, where the apparatus includes:
the text classification calculation module is used for calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier;
the text classification prediction module is used for predicting a second classification support rate of the text to be corrected under each classification type based on the association frequency of the established associated words under each classification type gathered in the text to be corrected;
and the classification label correction module is used for correcting the labeled classification labels of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected in each classification category.
In a third aspect, an embodiment of the present application provides an electronic device, including:
a processor and a memory, the memory being configured to store a computer program, the processor being configured to call up and run the computer program stored in the memory to perform the method of correcting a text classification tag provided in the first aspect of the present application.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium for storing a computer program, where the computer program makes a computer execute the method for correcting a text classification label as provided in the first aspect of the present application.
In a fifth aspect, the present application provides a computer program product, comprising computer program/instructions, wherein the computer program/instructions, when executed by a processor, implement the method for correcting a text classification tag as provided in the first aspect of the present application.
The method, the device, the equipment and the storage medium for correcting the text classification labels, which are provided by the embodiment of the application, are characterized in that a plurality of text classifiers are trained in advance, a corresponding associated vocabulary set is constructed for each set classification category in advance, so that when the classification labels of a text to be corrected are corrected, a first classification support rate of the text to be corrected under each classification category is calculated according to the classification result of the text to be corrected in each text classifier, and a second classification support rate of the text to be corrected under each classification category is predicted according to the association frequency of the associated vocabulary under each classification category in the text to be corrected, so that the real classification conditions of the text to be corrected are respectively analyzed from the two angles of the text classifier and the associated vocabulary set, the labeled classification labels of the text to be corrected are corrected, and the accuracy of the text classification labels is ensured, the corrected texts can be used for training corresponding text classification models subsequently, and therefore classification accuracy of the trained text classification models is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart illustrating a method for correcting a text classification label according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating a process for correcting a text classification label according to an embodiment of the present application;
FIG. 3 is a flow chart illustrating another method for correcting text classification labels according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a training process of a text classifier according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a process of constructing an association vocabulary set under each classification category according to an embodiment of the present application;
FIG. 6 is a schematic block diagram of an apparatus for correcting text classification tags according to an embodiment of the present application;
fig. 7 is a schematic block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In consideration of the problem that when a text classification model is trained, labeling errors exist in classification labels labeled in a large amount of texts acquired from crowdsourcing, and the classification accuracy of the text classification model is low, the embodiment of the application designs a method for correcting the classification labels of the texts in the crowdsourcing, and the real classification conditions of the texts in the crowdsourcing are analyzed from two different angles, so that the classification labels of the texts are corrected, the accuracy of the text classification labels is ensured, and the classification accuracy of the text classification model trained by using the texts is further improved.
Fig. 1 is a flowchart illustrating a method for correcting a text classification tag according to an embodiment of the present application. Referring to fig. 1, the method may specifically include the following steps:
s110, calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier.
In consideration of the fact that when a text classification model for distinguishing classification categories to which each text belongs is trained, each type of text serving as a training sample is labeled with a corresponding classification label, but in order to ensure the classification accuracy of the trained text classification model, the labeled classification labels of each type of text serving as a training sample are required to maintain certain accuracy, so that after a large number of types of texts labeled with the corresponding classification labels are obtained from numerous packages, the labeled classification labels of each text need to be corrected accurately.
For training of the text classification model, a corresponding text training library is constructed by using a large amount of texts obtained from numerous packages. Therefore, the text to be corrected in the application is any text in the constructed text training library, so that the classification label of each text in the text training library can be corrected, and the accuracy of the text classification label is ensured.
At the moment, in order to accurately judge the real classification condition of each text, the method and the device can analyze the category of the text to be corrected from two different angles of model classification and text semantics so as to comprehensively analyze the actual classification type of the text to be corrected, and therefore accurately correct the labeled classification label of the text to be corrected. In this step, a detailed description will be given mainly of a manner of analyzing the classification result of the text to be corrected from the viewpoint of model classification.
Considering that when the corresponding text classifier is trained by adopting each text with the uncorrected classification label, the classification accuracy of the text classifier for various texts is low, a plurality of different training sample sets can be respectively established, and a large amount of texts with different contents exist in the different training sample sets. For example, a set composed of texts with different contents can be selected from the constructed text training library respectively as each training sample set.
Then, each text in each training sample set is used to train a corresponding text classifier, so as to obtain a plurality of text classifiers with lower classification accuracy. At this time, because training sample sets adopted by the text classifiers are different, the classification performance of the text classifiers also has a certain difference, and therefore, classification results of the text classifiers when classifying the same text may also have a certain difference.
In addition, in order to ensure the accuracy of the text classifier for classifying the texts to a certain degree, the text classifier can set a plurality of different classification categories in advance according to the categories to which various texts possibly belong, and each classification category is also set in the text classifier, so that the text classifier can accurately judge the actual category of a certain text in each classification category to a certain degree.
As an optional implementation scheme in the embodiment of the present application, when a labeled classification label of a certain text is corrected, the text to be corrected is first input into each trained text classifier, as shown in fig. 2, each text classifier performs corresponding analysis on a category to which the text to be corrected belongs, so as to obtain a classification result of each text classifier on the text to be corrected, where at this time, there may be a certain difference in the classification results when each text classifier classifies the same text to be corrected. Then, the classification result of each text classifier for the text to be corrected is comprehensively analyzed, the possibility that the text to be corrected belongs to each classification category is judged, and therefore the first classification support rate of the text to be corrected under each classification category is calculated.
It should be noted that, in order to ensure the comprehensive accuracy of the first classification support rate, it is required in the present application that the number of trained text classifiers is greater than or equal to the number of set classification classes so that the classification result of each text classifier on the same text can comprehensively include the situation that all classification classes may be classified.
And S120, predicting a second classification support rate of the text to be corrected under each classification type based on the association frequency of the constructed associated words in the text to be corrected under each classification type.
In this step, the manner of analyzing the classification result of the text to be corrected from the viewpoint of text semantics will be mainly described in detail.
In particular, considering that the semantics of different texts are generally determined by the words contained in the texts, the application can analyze the semantic category of the text to be corrected by judging the number of words contained in the text to be corrected and associated with each classification category, for example, if most words in the text to be corrected are associated with the category of "motion", it can be determined that the text to be corrected belongs to the category of "motion".
As an optional implementation scheme in the embodiment of the present application, in order to accurately determine the number of words and phrases related to each classification category included in the text to be corrected, a corresponding relevant word set is constructed for each classification category in advance, and the relevant word set under each classification category includes a plurality of words and phrases having a certain association relationship with the classification category, for example, the relevant word set under the "sports" category may include words and phrases related to various sports, such as "sports", "football", "basketball", and "running".
In the present application, when the number of words and phrases related to each classification category included in the text to be corrected is determined to be large or small, it is possible to analyze, for each classification category, whether each associated word and phrase included in the associated vocabulary set under the classification category exists in the text to be corrected, as shown in fig. 2. And then, according to the number of the related words in the text to be corrected, determining the number of words related to the classification category contained in the text to be corrected, so as to analyze the possibility that the text to be corrected belongs to the classification category, and predicting the second classification support rate of the text to be corrected under the classification category. The above process is executed for each classification category, so that the second classification support rate of the text to be corrected under each classification category can be predicted
It should be understood that the first classification support rate and the second classification support rate of the text to be corrected under each classification category are respectively used for representing the possibility that the text to be corrected belongs to each classification category analyzed from two different angles, namely model classification and text semantics, and then the real category to which the text to be corrected belongs can be accurately determined by comprehensively analyzing the first classification support rate and the second classification support rate.
It should be noted that, in the present application, there is no sequential execution order between S110 and S120, and the execution order may be executed simultaneously or sequentially, which is not limited at all.
S130, correcting the labeled classification label of the text to be corrected by using the first classification support rate and the second classification support rate of the text to be corrected in each classification category.
After the first classification support rate and the second classification support rate of the text to be corrected in each classification category are obtained, the classification category to which the text to be corrected most possibly belongs can be determined from the angle of model classification according to the size of the first classification support rate of the text to be corrected in each classification category. And according to the second classification support rate of the text to be corrected under each classification category, the classification category to which the text to be corrected most possibly belongs can be determined from the perspective of text semantics. Then, whether the two classification categories which are determined from the two angles of model classification and text semantics and to which the text to be corrected most possibly belongs are consistent or not is judged, so that whether the labeled classification label of the text to be corrected is wrong or not can be determined, the labeled classification label of the text to be corrected is accurately corrected, and the accuracy of the text classification label is ensured.
For example, in the present application, the step of correcting the labeled classification label of the text to be corrected may specifically include: respectively determining a first belonged category of the maximum value of the text to be corrected in the first classification support rate and a second belonged category of the maximum value of the text to be corrected in the second classification support rate; and if the first belonged category is the same as the second belonged category, correcting the labeled classification label of the text to be corrected into the category name of the same belonged category.
That is, the maximum value in the first classification support rate of the text to be corrected in each classification category can be found out, and the first belonged category corresponding to the maximum value can represent the classification category to which the text to be corrected most probably belongs determined from the angle of model classification. And the maximum value in the second classification support rate of the text to be corrected in each classification category can be found out, and the second category corresponding to the maximum value can represent the classification category to which the text to be corrected most possibly belongs determined from the text semantics. If the first belonging category and the second belonging category are the same, which indicates that the text to be corrected is determined to belong to the same belonging category from two different angles, the same belonging category can be used as the real classification category of the text to be corrected. Therefore, when the labeled classification label of the text to be corrected is the same as the name of the same category, the labeled classification label of the text to be corrected is accurately labeled without correction; when the labeled classification label of the text to be corrected is different from the class name of the same belonged class, the labeling of the classification label of the text to be corrected is wrong, and the labeled classification label of the text to be corrected can be corrected to the class name of the same belonged class, so that the accuracy of the text classification label is ensured.
In addition, if the first belonged category and the second belonged category are different, which indicates that the true classification category to which the text to be corrected belongs cannot be accurately determined from two different angles, the correction operation is not performed on the labeled classification label of the text to be corrected with reference to the labeled classification label of the text to be corrected.
It should be understood that, for each text in the constructed text training library, the text to be corrected in the present application can be used as the text to be corrected, and the above-mentioned correction step of the text classification label is performed to accurately correct the labeled classification label of each text in the text training library, thereby ensuring the accuracy of each text classification label in the text training library. When the corresponding text classification model is trained by adopting the text training library subsequently, the classification accuracy of the text classification model can be further improved.
The technical scheme provided by the embodiment of the application is that a plurality of text classifiers are trained in advance, and a corresponding associated vocabulary set is constructed for each set classification category in advance, so that when the classification label of a text to be corrected is corrected, the first classification support rate of the text to be corrected under each classification category is calculated according to the classification result of the text to be corrected in each text classifier, and the second classification support rate of the text to be corrected under each classification category is predicted according to the association frequency of the associated vocabulary under each classification category in the text to be corrected, so that the real classification condition of the text to be corrected is respectively analyzed from two angles of the text classifier and the associated vocabulary set, so that the labeled classification label of the text to be corrected is corrected, the accuracy of the text classification label is ensured, and a corresponding text classification model can be trained subsequently by adopting the corrected text, and further, the classification accuracy of the trained text classification model is improved.
As an optional implementation scheme in the embodiment of the present application, in order to ensure accurate correction of the text classification label, a specific determination process of a first classification support rate and a second classification support rate of a text to be corrected in each classification category is described in detail in the present application.
Fig. 3 is a flowchart illustrating another text classification label correction method according to an embodiment of the present application.
As shown in fig. 3, the method may specifically include the following steps:
and S310, respectively inputting the texts to be corrected into each trained text classifier to obtain a classification result matrix of the texts to be corrected.
When the possibility that the text to be corrected actually belongs to the category is analyzed from the angle of model classification, the text to be corrected can be respectively input into each trained text classifier, and the category of the text to be corrected is analyzed by each text classifier, so that the classification result of each text classifier on the output of the text to be corrected is obtained. And then, combining the classification results output by each text classifier for the text to be corrected to obtain a classification result matrix of the text to be corrected.
For example, the classification result matrix of the text to be corrected may adopt Y ═ L, Yj,L]Is represented by (a) in which yjAnd the classification result matrix has n elements, wherein n is the number of the text classifiers.
It should be noted that, in the present application, for training each text classifier, the following steps may be performed: extracting corresponding test text sets in batches from the constructed text training library; and training corresponding text classifiers by respectively using the test text sets extracted in each batch to obtain the trained text classifiers.
That is to say, in order to obtain training samples of each text classifier, the present application may extract a corresponding number of texts from a pre-constructed text training library in batches, as shown in fig. 4, to form a test text set corresponding to each batch. For example, 20% of the text data may be extracted from the text training library each time and combined into a corresponding test text set, and the remaining 80% of the text data may be used as a verification text set of a text classifier trained by using the test text set, so as to verify the classification capability of the text classifier subsequently. Then, a plurality of test text sets can be obtained through extraction in batches, a corresponding text classifier can be trained by respectively using the extracted test text sets in each batch, and the trained text classifiers can be obtained through training after extraction in batches.
It should be understood that, in order to ensure the comprehensiveness of the classification of the text to be corrected by the multiple text classifiers, a certain difference may be required between the classification performances of the individual text classifiers, so that the corresponding test text sets are extracted from the text training library in batches, and texts with different contents are selected for extraction in each batch, so that the test text sets extracted in different batches are different from each other.
S320, calculating a first classification support rate of the text to be corrected under the classification category based on the occurrence frequency of the category name of each classification category in the classification result matrix.
After the classification result matrix of the text to be corrected is obtained, the classification result matrix contains the classification results output by the text classifiers to be corrected, so that when the possibility that the text to be corrected belongs to each classification category is analyzed, the occurrence frequency (namely, the occurrence frequency in the application) of the category name of the classification category in the classification result matrix is counted for each classification category, and the proportion of the total number of the classification results (namely, the number of trained text classifiers) contained in the classification result matrix by the occurrence frequency is calculated to serve as the first classification support rate of the text to be corrected in the classification category. The above processes are respectively executed for each classification category, so that the first classification support rate of the text to be corrected under each classification category can be calculated.
For example, the calculation formula of the first classification support rate of the text to be corrected under each classification category in the present application may be:
Figure BDA0003381675990000101
where yj is the category name of the jth category, count () is the count function, count (y)j) And when the classification result output by the text classifier is the class name of the jth classification class, the occurrence frequency in the classification result matrix is shown, and n is the number of the text classifiers.
S330, counting the occurrence frequency of the associated vocabulary in the associated vocabulary set under each classification category in the text to be corrected.
When the possibility that the text to be corrected actually belongs to the category is analyzed from the perspective of text semantics, for each category, each vocabulary in the text to be corrected is checked at first, whether the vocabulary is the associated vocabulary in the associated vocabulary set under the category is judged, so that the frequency of each associated vocabulary in the associated vocabulary set under the category appearing in the text to be corrected is analyzed, and the frequency of the associated vocabulary in the associated vocabulary set under the category appearing in the text to be corrected can be obtained through statistics. The above process is respectively executed for each classification category, so that the occurrence frequency of the associated vocabulary in the associated vocabulary set under each classification category in the text to be corrected can be counted, and the possibility that the text to be corrected belongs to each classification category can be analyzed subsequently.
S340, calculating a second classification support rate of the text to be corrected under each classification type according to the occurrence times under each classification type and the total number of words in the text to be corrected.
Optionally, after obtaining the occurrence number of the associated vocabulary in the associated vocabulary set in each classification category in the text to be corrected, calculating a ratio of the occurrence number in each classification category to the total number of vocabularies in the text to be corrected, where the ratio may indicate a possibility that the text to be corrected belongs to the classification category, that is, a second classification support rate of the text to be corrected in the classification category. And executing the same steps for each classification category to calculate the second classification support rate of the text to be corrected under each classification category.
For example, assuming that A, B, C three related words are contained in the related word collection under a certain classification category, and the text to be corrected is acdehobacbkda, it may be determined that the total number of words in the text to be corrected is 13, and the sequence of A, B, C three related words contained in the related word collection under the classification category in the text to be corrected is: A. c, B, A, C, B, A, thereby determining that the occurrence frequency of the associated words in the associated word collection under the classification category in the text to be corrected is 7, and the second classification support rate of the text to be corrected under the classification category is 7/13.
In order to ensure the comprehensiveness of the associated vocabulary set under each classification category, the application analyzes the alternative vocabulary of the category name of each classification category to construct the associated vocabulary set under the classification category before predicting the second classification support rate of the text to be corrected under each classification category. That is, the related vocabulary set under each classification category can be constructed by analyzing the vocabularies in the large number of texts having similar meanings to the category name of each classification category, using the vocabularies as the alternative vocabularies of the category name of the classification category, and then combining the alternative vocabularies.
Illustratively, when the replaceable vocabulary of the category name of each classification category is analyzed, aiming at each classification category, a target text under the classification category is searched from a constructed text training library, and the target text contains the category name of the classification category; hiding the category name of the classification category in each target text, inputting each hidden target text into the constructed semantic replacement model, and outputting the replaceable vocabulary of the category name of the classification category.
Specifically, considering that different vocabulary expressions are adopted in texts with different semantics in the same classification category, in order to accurately analyze the different vocabulary expressions of the texts with different semantics in the same classification category, a semantic replacement model is constructed in advance, the semantic replacement model can accurately analyze the semantics of each input text, and then various vocabularies suitable for being filled in the vacant positions of the texts are output according to the semantic information of each text. The semantic replacement Model in the present application may be a Bidirectional Encoding Representation (BERT) Model, and a Masked Language Model (MLM) set in the BERT is used to understand semantics at a vacant position in each input text and predict various vocabularies that can be replaced by the semantic replacement Model.
In this case, in order to perform a complete substitution analysis on the category name of each classification category, it is first necessary to search for each target text containing the category name of the classification category in the text from the constructed text training library for each classification category. Then, as shown in fig. 5, the category name of each category is hidden in each target text found under the category, so that the target text is vacant at the position of the category name of the category. Furthermore, for each classification category, each target text of the hidden category name under the classification category can be respectively input into the constructed semantic replacement model, and the semantic replacement model is used for performing semantic analysis on the vacant position in each target text after the category name of the classification category is hidden, so that various vocabularies suitable for being filled in the vacant position in each target text can be output for each target text, and the output vocabularies are used as replaceable vocabularies of the category name in the target text. That is, for each target text under a certain classification category, the semantic replacement model outputs a plurality of replaceable words of the category name of the classification category, so for each classification category, for each target text under the classification category, the TOP50 algorithm is used to obtain the replaceable word in the target text that is most similar to the category name of the classification category. Then, merging all the replaceable words obtained by the TOP50 algorithm of all the target texts under the classification category, and according to the word frequency size, the combination stop words and the like of all the replaceable words, continuously obtaining a corresponding number of replaceable words from the merged replaceable words by adopting the TOP100 algorithm to construct a relevant word collection under the classification category. And respectively executing the processes aiming at each classification category to construct an associated vocabulary set under each classification category.
And S350, correcting the labeled classification label of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected in each classification category.
According to the technical scheme provided by the embodiment of the application, the first classification support rate of the text to be corrected under each classification category is calculated according to the classification result of the text to be corrected in each text classifier, the second classification support rate of the text to be corrected under each classification category is predicted according to the association frequency of the association words under each classification category in the text to be corrected, so that the real classification conditions of the text to be corrected are respectively analyzed from two angles of the text classifier and the association words, the labeled classification labels of the text to be corrected are corrected, the accuracy of the text classification labels is ensured, the corrected text can be used for training a corresponding text classification model subsequently, and the classification accuracy of the trained text classification model is further improved.
Fig. 6 is a schematic block diagram of a device for correcting a text classification tag according to an embodiment of the present application.
As shown in fig. 6, the apparatus 600 may include:
a text classification calculating module 610, configured to calculate, based on a classification result of a text to be corrected in each trained text classifier, a first classification support rate of the text to be corrected under each set classification category;
the text classification prediction module 620 is configured to predict a second classification support rate of the text to be corrected in each classification category based on the association frequency of the built associated words in the text to be corrected in each classification category;
the classification label correction module 630 is configured to correct the labeled classification label of the text to be corrected by using the first classification supporting rate and the second classification supporting rate of the text to be corrected in each classification category.
Further, the text classification calculating module 610 may be specifically configured to:
respectively inputting the text to be corrected into each trained text classifier to obtain a classification result matrix of the text to be corrected;
and calculating the first classification support rate of the text to be corrected under each classification category based on the occurrence frequency of the category name of each classification category in the classification result matrix.
Further, the apparatus 600 for correcting the text classification label may further include:
the text extraction module is used for extracting corresponding test text sets in batches from the constructed text training library;
and the classifier training module is used for training the corresponding text classifier by respectively utilizing the test text set extracted in each batch so as to obtain each trained text classifier.
Furthermore, the test text sets extracted in different batches are different.
Further, the text classification prediction module 620 may be specifically configured to:
counting the occurrence times of the associated vocabulary in the associated vocabulary set under each classification category in the text to be corrected;
and calculating a second classification support rate of the text to be corrected under each classification category according to the occurrence times under each classification category and the total number of words in the text to be corrected.
Further, the apparatus 600 for correcting the text classification label may further include:
and the vocabulary set building module is used for analyzing the replaceable vocabulary of the category name of each classification category so as to build a relevant vocabulary set under the classification category.
Further, the vocabulary set constructing module may be specifically configured to:
aiming at each classification category, searching out a target text under the classification category from a constructed text training library, wherein the target text contains a category name of the classification category;
hiding the category name of the classification category in each target text, inputting each hidden target text into a constructed semantic replacement model, and outputting a replaceable vocabulary of the category name of the classification category.
Further, the semantic replacement model is a bidirectional coding representation BERT model based on a converter.
Further, the classification label correction module 630 may be specifically configured to:
respectively determining a first belonged category of the text to be corrected with the maximum value within the first classification supporting rate and a second belonged category of the text to be corrected with the maximum value within the second classification supporting rate;
and if the first belonged category is the same as the second belonged category, correcting the labeled classification label of the text to be corrected into the category name of the same belonged category.
Furthermore, the number of the trained text classifiers is greater than or equal to the number of the set classification classes, and the text to be corrected is any text in the constructed text training library.
In the embodiment of the application, a plurality of text classifiers are trained in advance, a corresponding associated vocabulary set is constructed for each set classification category in advance, so that when the classification label of the text to be corrected is corrected, the first classification support rate of the text to be corrected under each classification category is calculated according to the classification result of the text to be corrected in each text classifier, the second classification support rate of the text to be corrected under each classification category is predicted according to the association frequency of the associated vocabulary under each classification category in the text to be corrected, the real classification condition of the text to be corrected is respectively analyzed from two angles of the text classifier and the associated vocabulary set, the labeled classification label of the text to be corrected is corrected, the accuracy of the text classification label is ensured, and a corresponding text classification model can be trained by adopting the corrected text subsequently, and further, the classification accuracy of the trained text classification model is improved.
It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 600 shown in fig. 6 may perform any method embodiment provided in the present application, and the foregoing and other operations and/or functions of each module in the apparatus 600 are respectively for implementing corresponding processes in each method of the embodiment of the present application, and are not described herein again for brevity.
The apparatus 600 of the embodiments of the present application is described above in connection with the figures from the perspective of functional modules. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.
Fig. 7 is a schematic block diagram of an electronic device 700 provided in an embodiment of the present application.
As shown in fig. 7, the electronic device 700 may include:
a memory 710 and a processor 720, the memory 710 for storing a computer program and transferring the program code to the processor 720. In other words, the processor 720 may call and run a computer program from the memory 710 to implement the method for correcting the text classification label in the embodiment of the present application.
For example, the processor 720 may be configured to perform the above-described method embodiments according to instructions in the computer program.
In some embodiments of the present application, the processor 720 may include, but is not limited to:
general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.
In some embodiments of the present application, the memory 710 includes, but is not limited to:
volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 710 and executed by the processor 720 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, the instruction segments describing the execution of the computer program in the electronic device.
As shown in fig. 7, the electronic device may further include:
a transceiver 730, the transceiver 730 being connectable to the processor 720 or the memory 710.
The processor 720 may control the transceiver 730 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 730 may include a transmitter and a receiver. The transceiver 730 may further include an antenna, and the number of antennas may be one or more.
It should be understood that the various components in the electronic device are connected by a bus system that includes a power bus, a control bus, and a status signal bus in addition to a data bus.
Embodiments of the present application also provide a computer storage medium having a computer program stored thereon, where the computer program, when executed by a computer, enables the computer to execute the method of the above method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.
When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application occur, in whole or in part, when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. A method for correcting a text classification label, comprising:
calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier;
predicting a second classification support rate of the text to be corrected under each classification category based on the association frequency of the constructed associated words in the text to be corrected under each classification category;
and correcting the labeled classification label of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected under each classification category.
2. The method according to claim 1, wherein the calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier comprises:
respectively inputting the text to be corrected into each trained text classifier to obtain a classification result matrix of the text to be corrected;
and calculating the first classification support rate of the text to be corrected under each classification category based on the occurrence frequency of the category name of each classification category in the classification result matrix.
3. The method according to claim 1, further comprising, before calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier:
extracting corresponding test text sets in batches from the constructed text training library;
respectively training corresponding text classifiers by using the test text sets extracted in each batch to obtain the trained text classifiers;
wherein the test text sets extracted from different batches are different.
4. The method according to claim 1, wherein the predicting a second classification support rate of the text to be corrected in each classification category based on the association frequency of the constructed associated words in the text to be corrected in each classification category comprises:
counting the occurrence times of the associated vocabulary in the associated vocabulary set under each classification category in the text to be corrected;
and calculating a second classification support rate of the text to be corrected under each classification category according to the occurrence times under each classification category and the total number of words in the text to be corrected.
5. The method according to claim 1, further comprising, before predicting a second classification support rate of the text to be corrected in each classification category based on the association frequency of the constructed associated words in the text to be corrected, the second classification support rate being:
the alternative vocabulary for the category name for each classification category is analyzed to construct a collection of related words under the classification category.
6. The method of claim 5, wherein analyzing the alternative vocabulary of category names for each classification category comprises:
aiming at each classification category, searching out a target text under the classification category from a constructed text training library, wherein the target text contains a category name of the classification category;
hiding the category name of the classification category in each target text, inputting each hidden target text into a constructed semantic replacement model, and outputting a replaceable vocabulary of the category name of the classification category.
7. The method according to claim 1, wherein the correcting the labeled classification label of the text to be corrected by using the first classification support rate and the second classification support rate of the text to be corrected in each classification category comprises:
respectively determining a first belonged category of the text to be corrected with the maximum value within the first classification supporting rate and a second belonged category of the text to be corrected with the maximum value within the second classification supporting rate;
and if the first belonged category is the same as the second belonged category, correcting the labeled classification label of the text to be corrected into the category name of the same belonged category.
8. The method according to any one of claims 1 to 7, wherein the number of trained text classifiers is greater than or equal to the number of set classification classes, and the text to be corrected is any text in a constructed text training library.
9. An apparatus for correcting a text classification label, comprising:
the text classification calculation module is used for calculating a first classification support rate of the text to be corrected under each set classification category based on the classification result of the text to be corrected in each trained text classifier;
the text classification prediction module is used for predicting a second classification support rate of the text to be corrected under each classification type based on the association frequency of the established associated words under each classification type gathered in the text to be corrected;
and the classification label correction module is used for correcting the labeled classification labels of the text to be corrected by utilizing the first classification support rate and the second classification support rate of the text to be corrected in each classification category.
10. An electronic device, comprising:
a processor and a memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of correcting a text classification tag of any one of claims 1-8.
11. A computer-readable storage medium for storing a computer program which causes a computer to execute the method of correcting a text classification label according to any one of claims 1-8.
12. A computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the method of correction of a text classification tag according to any of claims 1-8.
CN202111435681.0A 2021-11-29 2021-11-29 Method, device and equipment for correcting text classification label and storage medium Pending CN113987136A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111435681.0A CN113987136A (en) 2021-11-29 2021-11-29 Method, device and equipment for correcting text classification label and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111435681.0A CN113987136A (en) 2021-11-29 2021-11-29 Method, device and equipment for correcting text classification label and storage medium

Publications (1)

Publication Number Publication Date
CN113987136A true CN113987136A (en) 2022-01-28

Family

ID=79732561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111435681.0A Pending CN113987136A (en) 2021-11-29 2021-11-29 Method, device and equipment for correcting text classification label and storage medium

Country Status (1)

Country Link
CN (1) CN113987136A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023151284A1 (en) * 2022-02-14 2023-08-17 苏州浪潮智能科技有限公司 Classification result correction method and system, device, and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023151284A1 (en) * 2022-02-14 2023-08-17 苏州浪潮智能科技有限公司 Classification result correction method and system, device, and medium

Similar Documents

Publication Publication Date Title
CN107992596B (en) Text clustering method, text clustering device, server and storage medium
CN107908635B (en) Method and device for establishing text classification model and text classification
US11645554B2 (en) Method and apparatus for recognizing a low-quality article based on artificial intelligence, device and medium
CN110020422B (en) Feature word determining method and device and server
CN107423278B (en) Evaluation element identification method, device and system
CN108549656B (en) Statement analysis method and device, computer equipment and readable medium
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN109902285B (en) Corpus classification method, corpus classification device, computer equipment and storage medium
CN110427487B (en) Data labeling method and device and storage medium
CN110175851B (en) Cheating behavior detection method and device
CN110309511B (en) Shared representation-based multitask language analysis system and method
CN111259144A (en) Multi-model fusion text matching method, device, equipment and storage medium
CN109905385A (en) A kind of webshell detection method, apparatus and system
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
WO2022048194A1 (en) Method, apparatus and device for optimizing event subject identification model, and readable storage medium
US11182605B2 (en) Search device, search method, search program, and recording medium
CN111767738A (en) Label checking method, device, equipment and storage medium
CN114090794A (en) Event map construction method based on artificial intelligence and related equipment
CN113011191A (en) Knowledge joint extraction model training method
Aralikatte et al. Fault in your stars: an analysis of android app reviews
CN116049412A (en) Text classification method, model training method, device and electronic equipment
CN110275953B (en) Personality classification method and apparatus
CN110968664A (en) Document retrieval method, device, equipment and medium
US11966455B2 (en) Text partitioning method, text classifying method, apparatus, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination