CN112989051B - Text classification method, device, equipment and computer readable storage medium - Google Patents

Text classification method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112989051B
CN112989051B CN202110392536.2A CN202110392536A CN112989051B CN 112989051 B CN112989051 B CN 112989051B CN 202110392536 A CN202110392536 A CN 202110392536A CN 112989051 B CN112989051 B CN 112989051B
Authority
CN
China
Prior art keywords
text
category
word
classification
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110392536.2A
Other languages
Chinese (zh)
Other versions
CN112989051A (en
Inventor
郭良越
丁文彪
刘琼琼
刘子韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110392536.2A priority Critical patent/CN112989051B/en
Publication of CN112989051A publication Critical patent/CN112989051A/en
Application granted granted Critical
Publication of CN112989051B publication Critical patent/CN112989051B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0242Determining effectiveness of advertisements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, apparatus, device, and computer-readable storage medium for text classification. The method comprises the following steps: and inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model for text classification is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset classes, and the labeling probability value corresponding to each preset class of the text samples is determined according to the number of the labeling results corresponding to the preset classes and the total number of the labeling results of the text samples. The method disclosed by the invention is based on the text classification model, so that the accuracy of text classification of the target text is higher.

Description

Text classification method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, device, and computer-readable storage medium.
Background
In many scenarios, the text classification is involved, and the text is classified and labeled according to a certain classification system or standard. For example, in a text review scenario, whether text content is inappropriate needs to be reviewed, so as to filter out sensitive, vulgar, advertisement and other content, and the text review process is also a text classification process in practice, for example, the text category may be defined as category 1, category 2, category 3, or the like.
At present, text classification is mostly carried out by adopting a text classification model. Labels of training samples of the text classification model are usually labeled by a single person, so that the text classification model is trained according to the training samples and the labels thereof.
However, the above-mentioned way of labeling labels makes the accuracy of text classification using a text classification model not high.
Disclosure of Invention
To solve the above technical problem or to at least partially solve the above technical problem, the present disclosure provides a method, an apparatus, a device, and a computer-readable storage medium for text classification.
In a first aspect, the present disclosure provides a method for text classification, including:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, before the step of inputting the target text into the text classification model and obtaining the prediction frequency values corresponding to the target text and all the preset categories, the method further includes:
obtaining marking results of the text samples, wherein each marking result corresponds to a preset category;
for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset classes.
Optionally, the determining, according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample, a labeling probability value of the text sample corresponding to the preset category includes:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining, according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample, a labeling probability value of the text sample corresponding to the preset category includes:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training of the text classification model according to the label probability values respectively corresponding to the text sample and all the preset categories includes:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
Optionally, after the text sample is input into the text classification model and prediction frequency values respectively corresponding to the text sample and all the preset categories are obtained, the method further includes:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the adjusting the parameters of the text classification model according to the classification loss of the text sample includes:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
Optionally, the inputting the text sample into the text classification model includes:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
In a second aspect, the present disclosure provides a method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, before the step of inputting the target text into the text classification model and obtaining the prediction frequency values corresponding to the target text and all the preset categories, the method further includes:
inputting a text sample into the text classification model;
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
and adjusting parameters of the text classification model according to the similarity loss of the text samples, and returning to execute the step of inputting the text samples into the text classification model until the text classification model converges.
Optionally, before determining that the text sample contains the multi-category word, the method further includes:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
In a third aspect, the present disclosure provides an apparatus for text classification, including:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring marking results of the text samples, wherein each marking result corresponds to a preset category;
the determination module is further configured to: for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and the training module is used for training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
Optionally, the determining module is specifically configured to:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining module is specifically configured to:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training module is specifically configured to:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
Optionally, the determining module is further configured to:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the training module is specifically configured to:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
Optionally, the training module is specifically configured to:
inputting a text sample into the text classification model to obtain a target category of the text sample;
the determining module is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
In a fourth aspect, the present disclosure provides an apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, the apparatus further comprises:
the input module is used for inputting the text sample into the text classification model;
the determination module is further to: determining that the text sample contains multi-category words; for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text; determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
and the adjusting module is used for adjusting the parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the apparatus further comprises:
the obtaining module is used for inputting the text sample into the text classification model to obtain the target category of the text sample;
the determining module is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
In a fifth aspect, the present disclosure provides an apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method according to the first aspect as described above when the computer program is executed.
In a sixth aspect, the present disclosure provides an apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method according to the second aspect when the computer program is executed.
In a seventh aspect, the present disclosure provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method for text classification as described in the first aspect above when executed by a processor.
In an eighth aspect, the present disclosure provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of text classification as described in the second aspect above when executed by a processor.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
and inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model for text classification is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset classes, and the labeling probability value corresponding to each preset class of the text samples is determined according to the number of the labeling results corresponding to the preset classes and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a text classification model provided by the present disclosure;
fig. 2 is a schematic flowchart of a text classification method according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a training method of a text classification model according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of another training method for a text classification model according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a training method of a text classification model according to another embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating another method for text classification according to an embodiment of the present disclosure;
fig. 7 is a flowchart illustrating a method for training a text classification model according to another embodiment of the present disclosure;
fig. 8 is a flowchart illustrating a method for training a text classification model according to another embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an apparatus for text classification according to an embodiment of the present disclosure;
FIG. 10 is a schematic diagram of an apparatus for classifying text according to an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of a text classification device according to an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of another text classification device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In many scenarios, text classification is involved, i.e., text is classified and labeled according to a certain classification system or standard.
The process of text review is also a process of text classification actually, and a specific application scenario of the present disclosure is described below by taking a text review scenario as an example, where the process of text review is to review whether the content of the target text is inappropriate, so as to filter out contents such as sensitive, popular and advertisement. The target text may be preset into a plurality of categories, that is, preset categories, for example, the preset categories may include, but are not limited to: a normal category and a violation category, etc. For any target text, the target category of the target text can be determined by examining the content of the target text. The traditional manual text auditing mode has low auditing efficiency and is difficult to support more and more business requirements.
At present, a text classification model based on machine learning is mostly adopted for text classification. The training of the text classification model is supervised training (supervised learning), i.e. the text classification model needs to be trained by using the text samples and the labels thereof. The labels used by the text classification model in the training process are used for marking the real categories of the text samples, so that the text classification model adjusts the model parameters according to the real categories, and the trained text classification model is obtained. Labels of text samples of the text classification model are usually labeled manually, wherein the labels can be labeled by a single person or by multiple persons, and if the labels are labeled by the single person, the labeled types are the labels of the text samples and participate in the training of the text classification model. Because single marking has subjectivity, a multi-person marking mode can be adopted, and therefore errors of single marking are avoided. And if the labels are marked for multiple people, selecting a mode according to the marking result marked by the multiple people, namely, using the preset category with the maximum number of marked people in the labels marked by the multiple people as the label of the text sample. And training the text classification model according to the text samples and the labels thereof. Regardless of the above single-person labeling or multi-person labeling, the finally obtained label is labeled with only one preset category, and further, it may be a one-hot (one-hot) label, for example, if the label vector is (0, 0, 0, 1, 0), only one component in the label vector has a value of 1, and the rest are 0, then the category of the text sample is the preset category represented by the component having the value of 1.
However, the manual labeling has subjectivity, that is, labels labeled by different people may be different for the same text, and on the other hand, the same text may be determined to be different preset categories in different business scenes. The accuracy of the labels of the text samples directly affects the accuracy of the training of the text classification model, and the accuracy of the text classification model is not high due to the mode.
The disclosure provides a text classification method, a text classification device, a text classification equipment and a computer readable storage medium. The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeled label is more objective, the accuracy of the text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
The model structure of the text classification model may be an Encoder (Encoder) structure of a transducer (attention-based transformer), for example, a Bidirectional Encoder from the transformer represents a pre-training model (Bert), the following description takes the working principle of the text classification model shown in fig. 1 as an example to illustrate the principle of the text classification model, fig. 1 is a schematic diagram of the text classification model provided by the present disclosure, and as shown in fig. 1, a target text 101 is: the "teacher is in class", and the text classification model 102 is input, and may be composed of a Bert and classification layers, where the Bert may include a feature extraction layer such as a word vector, a text vector, and a position vector for extracting the target text 101, an output of the Bert is a word vector (embeddings) corresponding to each word of the target text 101, and a beginning symbol (CLS) of the Bert is added according to the target text 101, and the output vector corresponding to the beginning symbol is used as a semantic representation of the target text 101 for classifying the target text 101. And taking the output vector corresponding to the CLS in the Bert as the input of a classification layer, wherein the output of the classification layer is the prediction frequency values respectively corresponding to the target text 101 and all preset classifications, and thus obtaining the target category of the target text 101 based on the prediction frequency values.
Alternatively, the word vector of the target text 101 may be obtained by the following formula (1):
Figure 398945DEST_PATH_IMAGE001
formula (1)
H is a word vector corresponding to each word of the target text, and X is the target text.
Alternatively, the classification layer may be obtained by the following formula (2) and formula (3):
z = g (Wv + b) equation (2)
Figure 151000DEST_PATH_IMAGE002
Formula (3)
Where v is a word vector corresponding to CLS, q is a predicted frequency value corresponding to a preset classification, and W and b are parameters.
The following describes the technical solutions of the present disclosure and how to solve the above problems with specific examples.
Fig. 2 is a schematic flowchart of a text classification method provided in an embodiment of the present disclosure, as shown in fig. 2, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s201, inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories.
The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The text classification model may be a model structure of the above embodiment, such as the model structure shown in fig. 1.
In this embodiment, before text classification is performed on a target text by using a text classification model, the text classification model is trained in advance. In the process of training the text classification model, a text sample needs to be labeled first to obtain labeling probability values respectively corresponding to the text sample and all preset categories, namely labels corresponding to the text sample, and then the text classification model is trained according to the text sample and the labels corresponding to the text sample.
In order to facilitate training of the text classification model, the labels may be represented by label vectors, and values in the label vectors sequentially represent label probability values corresponding to the text samples and preset categories, that is, probabilities that the categories of the text samples are the preset categories, it can be understood that each label probability value is any value greater than or equal to 0 and less than or equal to 1, and a sum of the label probability values corresponding to the text samples and all the preset categories is 1. For example, the preset categories are sequentially set in the tag vector as: class 1, class 2, class 3, class 4. If the label is (0.3, 0.7, 0, 0), it indicates that the label probability value of the text sample corresponding to the category 1 is 0.3 (the probability of the text sample being the category 1 is 0.3), and the label probability value of the text sample corresponding to the category 2 is 0.7 (the probability of the text sample being the category 2 is 0.7).
The label may be derived from a plurality of marking results. For example, m persons mark the categories of the text samples respectively, and m is an integer greater than 1, m marking results for the text samples can be obtained
Figure 165007DEST_PATH_IMAGE003
And each marking result indicates that the category of the text sample is one of the preset categories. For each text sample, the number of the marking results corresponding to the preset category and the total number m of the marking results in the m marking results can determine the label of the text sample, that is, the marking probability values of the text sample and all the preset categories respectively.
In a possible implementation manner, the ratio of the number of the labeling results corresponding to the preset category to the total number of the labeling results of the text sample is determined as a labeling probability value corresponding to the text sample and the preset category.
For example, the preset categories are set as: the method comprises the steps of class 1, class 2 and class 3, wherein for one text sample, 5 people are used for marking the text sample respectively to obtain 5 marking results, wherein 1 marking result is class 1, 4 marking results are class 2, the probability that the text sample corresponds to the class 1 is 0.2, and the probability that the text sample corresponds to the class 2 is 0.8.
In another possible implementation manner, the ratio is obtained through the above manner, and the product of the weight corresponding to the ratio and the ratio is determined as the label probability value corresponding to the text sample and the preset category.
After the text classification model is trained, the text classification model can be used for performing text classification on the target text, and the target text is input into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes.
S202, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text.
And the preset category corresponding to the maximum prediction frequency value in the prediction frequency values respectively corresponding to the target text and all the preset categories is the target category corresponding to the target text. For example, the predicted frequency value of the output of the text classification model sequentially represents the predicted frequency values of the preset category sequence: class 1, class 2, class 3, class 4. Assuming that the target text is input into the text classification model, and the output of the text classification model is 0.2, 0.8, 0 and 0 in sequence, the prediction frequency value corresponding to the target text and the category 1 is 0.2, the prediction frequency value corresponding to the target text and the category 2 is 0.8, and the prediction frequency values corresponding to the target text and the categories 3 and 4 are both 0. The maximum predicted frequency value is 0.8, the corresponding preset category is category 2, and the category of the target text is category 2.
In this embodiment, the prediction frequency values respectively corresponding to the target text and all the preset categories are obtained by inputting the target text into the text classification model. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Fig. 3 is a schematic flowchart of a training method for a text classification model according to an embodiment of the present disclosure, as shown in fig. 3, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s301, obtaining a marking result of the text sample.
Each marking result corresponds to a preset category, namely each marking result indicates that the category of the text sample is one of the preset categories.
The training sample set for training the text classification model may include a plurality of text samples, each text sample corresponding to a plurality of labeled results. For example, for each text sample, the text sample can be labeled by m individuals in the preset category, and then m labeling results are obtained
Figure 635302DEST_PATH_IMAGE004
Wherein m is an integer greater than 1.
S302, aiming at each preset category, determining a labeling probability value corresponding to the preset category of the text sample according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample.
And determining the labeling probability values of the text samples corresponding to the preset categories according to the number of the labeling results corresponding to the preset categories and the total number of the labeling results of the text samples for each text sample and each preset category. For example, for each text sample, m individuals may mark the text sample in a preset category to obtain m marking results, and among the m marking results, the number of marking results corresponding to the preset category and the total number m of marking results may determine the marking probability values respectively corresponding to the text sample and all the preset categories, that is, the label of the text sample may be determined.
In a possible implementation manner, for each preset category, a ratio of the number of the labeling results corresponding to the preset category to the total number of the labeling results of the text sample is determined, and the ratio is a labeling probability value corresponding to the text sample and the preset category.
Assuming that there are k preset categories, each preset category is represented by an integer greater than 0, for each text sample, there are m labeled results of
Figure 771885DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 909606DEST_PATH_IMAGE005
j is any value from 1 to m, e.g. the first marked result
Figure 106232DEST_PATH_IMAGE006
If =2, the first labeling result indicates that the text sample is in the second preset category, and the labeling probability value of the text sample corresponding to the preset category can be obtained by the following formula (4):
Figure 798244DEST_PATH_IMAGE007
Figure 4098DEST_PATH_IMAGE008
formula (4)
Wherein the content of the first and second substances,
Figure 730745DEST_PATH_IMAGE009
is a labeling probability value corresponding to the ith preset category, m is the total number of labeling results,
Figure 98273DEST_PATH_IMAGE010
the jth marked result.
For example, the preset categories are set as: the method comprises the steps of class 1, class 2 and class 3, wherein for one text sample, 5 people are used for marking the text sample respectively, 5 marking results are obtained and are 1, 2, 2, 2 and 2 respectively, wherein 1 marking result is class 1, 4 marking results are class 2, the probability that the text sample corresponds to the class 1 is 0.2, and the probability that the text sample corresponds to the class 2 is 0.8.
For example, in order to facilitate training of the text classification model, the label and the labeling result may be represented by a label vector, and values in the label vector sequentially represent labeling probability values of the text sample corresponding to preset categories, that is, probabilities that the categories of the text sample are the preset categories. For example, the preset categories are sequentially set in the tag vector as: class 1, class 2, class 3. If a certain labeling result is that the text sample is of type 1, the labeling result can be represented by a label vector (1, 0, 0).
In another possible implementation manner, for each preset category, a ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample is obtained.
The method for obtaining the ratio is already described in the above implementation, and is not described herein.
And obtaining the product of the weight value corresponding to the ratio and the ratio. And determining the product as a mark probability value corresponding to the text sample and the preset category.
In order to increase the confidence level of the maximum prediction frequency value, the ratio r can be further transformed to obtain a label probability value corresponding to the text sample and the preset category, namely a softened label vector. For example, a weight corresponding to the ratio may be set, and a product of the weight corresponding to the ratio and the ratio is a labeling probability value corresponding to the text sample and the preset category. For example, the weight corresponding to the maximum ratio is set as a first weight, the weights corresponding to other ratios except the maximum ratio are set as second weights, the second weights are integers smaller than 1, and the first weights are integers larger than 1.
Optionally, the labeling probability value of the text sample corresponding to the preset category may be obtained by the following formula (5):
Figure 543160DEST_PATH_IMAGE011
formula (5)
Wherein the content of the first and second substances,
Figure 552705DEST_PATH_IMAGE012
is a label probability value corresponding to the ith preset category,
Figure 133859DEST_PATH_IMAGE013
the mode is taken for the m marked results,
Figure 937867DEST_PATH_IMAGE014
is a preset parameter, and
Figure 604472DEST_PATH_IMAGE014
≥1。
the confidence level of the preset category of the high ticket can be increased, namely, the maximum ratio is increased, and other ratios except the maximum ratio are reduced. For example, if a four-classification task of category 1, category 2, category 3, and category 4 is performed, two of the target texts in "xxxxxx" are labeled as category 2 among three labels, and one is labeled as category 4, the voting vector r = (0, 0.67, 0, 0.33) of the target text is obtained, and if α =3, the obtained label vector is p = (0, 0.89, 0, 0.11).
Optionally, if
Figure 118946DEST_PATH_IMAGE014
And if the value is =1, p = r, the obtained ratio is not processed, and the obtained ratio is directly used as the mark probability value corresponding to the text sample and the preset category. If it is
Figure 828277DEST_PATH_IMAGE014
And if the value is not less than infinity, only one item of p is 1, the rest are 0, and the labeling probability value of the text sample corresponding to the preset category is a unique heat vector obtained by taking a mode.
And S303, training a text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
In the training process, the classification loss of each text sample can be calculated, then the parameters of the text classification model are adjusted according to the classification loss of each text sample, the text sample is input into the text classification model in a return execution mode until the text classification model is converged, and the converged text classification model is obtained.
It is understood that the steps in the embodiment shown in fig. 3 may be performed separately for training the text classification model, or may be performed before S201, so that after the text classification model is trained, the target text is classified by using the text classification model.
In this embodiment, by obtaining the labeling results of the text samples, where each labeling result corresponds to one preset category, for each preset category, the labeling probability values of the text samples corresponding to the preset categories are determined according to the number of the labeling results corresponding to the preset categories and the total number of the labeling results of the text samples, and the text classification model is trained according to the labeling probability values of the text samples corresponding to all the preset categories. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Fig. 4 is a schematic flowchart of another training method for a text classification model according to an embodiment of the present disclosure, and fig. 4 is a flowchart of the embodiment shown in fig. 3, further, as shown in fig. 4, S303 may be implemented by the following steps S3031, S3032, S3033, and S3034:
s3031, inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all preset classes.
During training, multiple batches of training can be performed, and for each batch, a plurality of text samples are input into the text classification model for training.
S3032, aiming at each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the mark probability value corresponding to the target category.
Alternatively, the classification loss of each text sample can be obtained by the following formula (6):
Figure 247757DEST_PATH_IMAGE016
formula (6)
Wherein the content of the first and second substances,
Figure 599103DEST_PATH_IMAGE017
for the classification loss of the S-th text sample,
Figure 889270DEST_PATH_IMAGE012
for the mark corresponding to the ith preset category for the S text sampleThe probability value is recorded and used for recording the probability value,
Figure 35081DEST_PATH_IMAGE018
and marking the probability value of the output of the text classification model corresponding to the ith preset category.
S3033, adjusting parameters of the text classification model according to the classification loss of the text sample.
From the classification loss of the text samples, an overall classification loss of the text classification model may be determined. And adjusting parameters of the text classification model according to the total classification loss.
Alternatively, the total classification loss of the text classification model can be obtained by the following formula (7):
Figure 676278DEST_PATH_IMAGE019
formula (7)
Wherein the content of the first and second substances,
Figure 831316DEST_PATH_IMAGE020
for the total classification loss of the text classification model,
Figure 241569DEST_PATH_IMAGE017
is the classification loss of the S-th text sample.
S3034, judging whether the text classification model is converged.
Wherein the total classification loss of the text classification model is calculated for each batch.
Determining convergence of the text classification model, wherein the convergence of the text classification model can be determined through the classification loss smaller than a first preset threshold value; determining the convergence of the text classification model according to the fact that the change value of the classification loss obtained through multiple times of training is smaller than a second preset threshold; the convergence of the text classification model can also be determined in other ways, and the disclosure is not limited for determining the condition of the convergence of the text classification model.
And if the text classification model is converged, stopping training the text classification model. If the text classification model is not converged, the step S3031 is returned to until the text classification model is converged.
If only the single one-hot label is used for calculating the classification loss, the classification loss is determined by the preset category corresponding to the highest probability value in the one-hot label, and in the embodiment, the classification loss of each text sample is determined by the prediction frequency value and the label probability value of each preset category, and the situations of a plurality of preset categories are considered, so that the learning capability and the generalization capability of the text classification model are improved, and the accuracy of the text classification model is higher.
In other scenarios, in the text classification process, some words appear in different texts, and the preset types of the texts are different, for example, a certain word appears frequently in one preset type of text, but sometimes appears in other preset types of text. For example, in the context of text review, text belonging to the offending category typically contains certain ambiguous words that appear frequently in the offending sample, but text containing these words is not always the offending content. In the process of training the text classification model, if the text sample contains multi-class words, the text classification model only tends to judge whether the multi-class words exist, but fails to grasp the meaning of the multi-class words in the context sentence, so that the text classification model learns wrong information, and the accuracy of the text classification model is not high. In text review, many times, whether a word contains a "violation" level of meaning may not be learned by the model. Aiming at the technical problems, in the process of training the text classification model, the influence of multi-classification words in the text sample on the text classification model needs to be considered, and the similarity loss is determined according to the similarity of word vectors of the multi-classification words and reference word vectors of the multi-classification words contained in the text sample; the reference word vector of the multi-category word is the word vector of the multi-category word in the reference text. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 5 is a schematic flowchart of a training method of another text classification model provided in an embodiment of the present disclosure, and fig. 5 is based on the embodiment shown in fig. 4, further, as shown in fig. 5, after S3031, the following steps S3035, S3036, and S3037 may be further included, and accordingly, S3033 may be S30331:
s3035, judging whether the text sample contains the multi-classification words.
And judging whether the text sample contains multi-classification words or not according to each text sample, wherein a multi-classification word set can be preset, the multi-classification word set contains a plurality of multi-classification words, and whether the text sample contains the multi-classification words in the multi-classification word set or not is determined according to comparison between the multi-classification word set and the text sample.
If the text sample does not contain the multi-classification words, the similarity loss is 0, namely the similarity loss can be not calculated, and the text classification model is trained according to the existing process. If the text sample contains the multi-category word, S3036 is continued.
S3036, aiming at each multi-classification word, determining the similarity loss corresponding to the multi-classification word according to the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word.
The reference word vector of the multi-category word is a word vector of the multi-category word in the reference text.
Optionally, the word vector of the multi-class word in the text classification model may be obtained through the text classification model. As shown in fig. 1, in the BERT in the text classification model 102, a word vector of each word in the text is output, and if there are multiple classified words, the word vector of the multiple classified words can be directly obtained through output of the BERT.
Further, the process of determining the similarity loss may include:
and judging whether the target class of the text sample is the same as the reference class of the reference text or not.
And if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word.
Optionally, if the target category of the text sample is the same as the reference category of the reference text, the target category can be obtained by the following formula (8):
Figure 558280DEST_PATH_IMAGE021
formula (8)
Wherein the content of the first and second substances,
Figure 686773DEST_PATH_IMAGE022
in order to be a multi-category word,
Figure 114344DEST_PATH_IMAGE023
is composed of
Figure 644682DEST_PATH_IMAGE024
The corresponding loss is a loss of the power,
Figure 132295DEST_PATH_IMAGE025
is composed of
Figure 748084DEST_PATH_IMAGE024
The word vector of (a) is,
Figure 979346DEST_PATH_IMAGE026
is composed of
Figure 364191DEST_PATH_IMAGE024
The corresponding reference word vector.
And if the target type of the text sample is different from the reference type of the reference text, acquiring the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining a difference value with the similarity as the similarity loss corresponding to the multi-classification word.
Optionally, if the target category of the text sample is different from the reference category of the reference text, the similarity loss corresponding to the multi-category word may be obtained by the following formula (9):
Figure 288284DEST_PATH_IMAGE027
formula (9)
Wherein the content of the first and second substances,
Figure 125790DEST_PATH_IMAGE022
in order to be a multi-category word,
Figure 629584DEST_PATH_IMAGE023
is composed of
Figure 868935DEST_PATH_IMAGE022
The corresponding loss is a loss of the power,
Figure 698351DEST_PATH_IMAGE025
is composed of
Figure 757574DEST_PATH_IMAGE024
The word vector of (a) is,
Figure 65058DEST_PATH_IMAGE026
is composed of
Figure 690075DEST_PATH_IMAGE022
The corresponding reference word vector.
S3037, determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words.
After the similarity corresponding to each multi-classification word is obtained, the similarity loss of the text sample can be determined according to the similarity corresponding to each multi-classification word. Further, the similarity losses of all the multi-category words corresponding to the text sample may be added to obtain the similarity loss corresponding to the text sample. Illustratively, for the S-th text sample, the similarity loss corresponding to each multi-category word in the S-th text sample is obtained through the above formula (9)
Figure 424813DEST_PATH_IMAGE023
Adding the similarity losses corresponding to each of the multiple classified words to obtain the similarity loss of the S-th text sample
Figure 236911DEST_PATH_IMAGE028
Optionally, the similarity loss corresponding to the text classification model is determined according to the similarity loss of the text sample. For example, the similarity loss corresponding to the text classification model may be a result of adding similarity losses of all text samples, and the similarity loss corresponding to the text classification model may be obtained by the following formula (10):
Figure 348086DEST_PATH_IMAGE029
formula (10)
Wherein the content of the first and second substances,
Figure 308170DEST_PATH_IMAGE030
to the loss of similarity of the text classification model,
Figure 213809DEST_PATH_IMAGE028
the similarity loss for the S-th text sample.
In an exemplary manner, the first and second electrodes are,
Figure 513203DEST_PATH_IMAGE028
obtainable by the above formula (9)
Figure 428069DEST_PATH_IMAGE023
Adding and obtaining.
And S30331, adjusting parameters of the text classification model according to the classification loss of the text samples and the similarity loss of the text samples.
And adjusting parameters of the text classification model together according to the classification loss of the text samples and the similarity loss of the text samples.
Optionally, the total loss of the text classification model may be obtained according to the classification loss of the text sample and the similarity loss of the text sample, so that the parameter of the text classification model is adjusted according to the total loss. Wherein, the total loss of the text classification model can be obtained by the following formula (11):
Figure 496520DEST_PATH_IMAGE031
formula (11)
Wherein L is the total loss of the text classification model,
Figure 838639DEST_PATH_IMAGE020
in order to account for the overall classification loss,
Figure 94171DEST_PATH_IMAGE030
is the overall similarity loss.
Figure 78308DEST_PATH_IMAGE020
Can be obtained from the formula (7),
Figure 1264DEST_PATH_IMAGE030
can be obtained from equation (10).
It is understood that S3035, S3036 and S3037 are not in sequence with the execution of S3032, and S3032 may be executed first, and then S3035, S3036 and S3037 may be executed. S3035, S3036 and S3037 may be performed first, then S3032 may be performed, or S3035, S3036 and S3037 may be performed simultaneously with S3032.
In this embodiment, a text sample is determined to include multiple classified words, and for each multiple classified word, a similarity loss corresponding to the multiple classified word is determined according to a similarity between a word vector of the multiple classified word in the text sample and a reference word vector of the multiple classified word, where the reference word vector of the multiple classified word is a word vector of the multiple classified word in a reference text, and the similarity loss of the text sample is determined according to the similarity loss corresponding to all the multiple classified words. Because the similarity loss considers the semantic information of the context of the multi-classification words, the calculation of the similarity loss is increased in the process of training the text classification model, and the parameters of the text classification model are adjusted based on the similarity loss, so that the text classification model learns the difference of the multi-classification words, and the text classification model has higher accuracy in text classification.
Fig. 6 is a schematic flowchart of another text classification method provided in an embodiment of the present disclosure, as shown in fig. 6, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, where the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s601, inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories.
The text classification model is obtained by adjusting according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words and reference word vectors of the multi-classification words contained in a text sample; the reference word vector of the multi-category word is the word vector of the multi-category word in the reference text.
The text classification model is trained in advance before the text classification model is used for text classification of the target text. And in the process of training the text classification model, increasing the calculation of similarity loss, thereby adjusting the text classification model according to the similarity loss. And detecting whether the text sample contains multi-classification words or not according to each text sample, wherein a multi-classification word set can be preset and contains a plurality of multi-classification words, and determining whether the text sample contains the multi-classification words in the multi-classification word set or not according to comparison between the multi-classification word set and the text sample. If the text sample does not contain the multi-classification words, the similarity loss is 0, namely the similarity loss can be not calculated, and the text classification model is trained according to the existing process. If the text sample contains multi-category words, the similarity loss is determined according to the similarity of the word vector of each multi-category word in the text classification model and the reference word vector of the multi-category word. In the process of training the text classification model, the implementation principle of the process of adjusting the model according to the similarity loss is similar to that of the embodiment shown in fig. 5, and is not repeated here.
S602, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text.
The implementation principle of S602 is similar to that of S202, and is not described here again.
In this embodiment, the target text is input into the text classification model to obtain prediction frequency values corresponding to the target text and all the preset categories, and the preset category with the largest prediction frequency value is determined as the target category corresponding to the target text according to the prediction frequency values corresponding to the target text and all the preset categories. The text classification model is obtained through training according to the similarity loss, and the similarity loss is obtained according to the word vectors of the multi-classification words and the reference word vectors of the multi-classification words contained in the text sample. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 7 is a schematic flowchart of a training method for a text classification model according to another embodiment of the present disclosure, as shown in fig. 7, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster with computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
and S701, inputting the text sample into a text classification model.
S702, judging whether the text sample contains multi-classification words.
And if the text sample does not contain the multi-classification words, training a text classification model according to the existing process.
If the text sample contains multiple classified words, the process continues to step S703.
The implementation principle and implementation manner of steps S702 and S3035 are similar, and are not described herein again.
And S703, determining the similarity loss corresponding to the multi-category words according to the similarity between the word vector of the multi-category words in the text sample and the reference word vector of the multi-category words aiming at each multi-category word.
The reference word vector of the multi-category word is a word vector of the multi-category word in the reference text.
The implementation principle and implementation manner of steps S703 and S3036 are similar, and are not described herein again.
And S704, determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words.
The implementation principle and implementation manner of steps S704 and S3037 are similar, and are not described herein again.
S705, adjusting parameters of the text classification model according to the similarity loss of the text samples.
And S706, judging whether the text classification model is converged.
Determining convergence of the text classification model, wherein the convergence of the text classification model can be determined by that the similarity loss is smaller than a third preset threshold value; the convergence of the text classification model can also be determined according to the fact that the variation value between the similarity losses obtained through multiple training is smaller than a fourth preset threshold, and the disclosure is not limited for determining the convergence condition of the text classification model.
And if the text classification model is converged, stopping training the text classification model. If the text classification model is not converged, the step returns to execute S701 until the text classification model is converged.
It is to be understood that the steps in the embodiment shown in fig. 7 may be performed separately for training the text classification model, or may be performed before S601, so that after the text classification model is trained, the target text is classified by using the text classification model.
In this embodiment, a text sample is determined to include multiple classified words, and for each multiple classified word, a similarity loss corresponding to the multiple classified word is determined according to a similarity between a word vector of the multiple classified word in the text sample and a reference word vector of the multiple classified word, where the reference word vector of the multiple classified word is a word vector of the multiple classified word in a reference text. And determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words. And adjusting parameters of the text classification model according to the similarity loss of the text samples. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 8 is a schematic flowchart of a training method of another text classification model provided in an embodiment of the present disclosure, and fig. 8 is based on the embodiment shown in fig. 7, further, as shown in fig. 8, S701 may include S7011, and S703 may include S7031, S7032, S7033, and S7034:
s7011, inputting the text sample into the text classification model to obtain the target category of the text sample.
S7031, it is determined whether the target type of the text sample is the same as the reference type of the reference text.
If the target type of the text sample is the same as the reference type of the reference text, S7032 is continuously performed. If the target type of the text sample is different from the reference type of the reference text, S7033 is continuously executed.
S7032, determining similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words, and determining similarity loss corresponding to the multi-classification words.
S7033, obtaining similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words.
S7034, determining a similarity difference value as a similarity loss corresponding to the multi-category word.
The method of this embodiment, which is implemented in a manner similar to the principle and manner of S3036, is not described herein again.
Fig. 9 is a schematic structural diagram of a text classification apparatus provided in an embodiment of the present disclosure, and as shown in fig. 9, the apparatus of the embodiment includes:
an obtaining module 91, configured to input the target text into the text classification model, and obtain prediction frequency values corresponding to the target text and all preset categories, respectively; the text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples;
the determining module 92 is configured to determine, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, the preset category with the largest prediction frequency value as the target category corresponding to the target text.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring marking results of the text samples, wherein each marking result corresponds to a preset category;
the determination module 92 is further configured to: for each preset category, determining a labeling probability value corresponding to the text sample and the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and the training module is used for training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
Optionally, the determining module 92 is specifically configured to:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample, wherein the ratio is the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining module 92 is specifically configured to:
acquiring the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training module is specifically configured to:
inputting the text sample into a text classification model to obtain prediction frequency values respectively corresponding to the text sample and all preset classes;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the determining module 92 is further configured to:
determining that the text sample contains multi-category words;
for each multi-classification word, determining similarity loss corresponding to the multi-classification word according to similarity of a word vector of the multi-classification word in a text sample and a reference word vector of the multi-classification word, wherein the reference word vector of the multi-classification word is the word vector of the multi-classification word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words;
correspondingly, the training module is specifically configured to:
and adjusting parameters of the text classification model according to the classification loss of the text samples and the similarity loss of the text samples.
Optionally, the training module is specifically configured to:
inputting the text sample into a text classification model to obtain a target category of the text sample;
the determining module 92 is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, acquiring the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value with the similarity is the similarity loss corresponding to the multi-classification word.
Fig. 10 is a schematic structural diagram of an apparatus for classifying texts according to another embodiment of the present disclosure, as shown in fig. 10, the apparatus provided in this embodiment includes:
the obtaining module 11 is configured to input the target text into the text classification model, and obtain prediction frequency values corresponding to the target text and all preset categories respectively; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words and reference word vectors of the multi-classification words contained in a text sample; the reference word vector of the multi-classification word is a word vector of the multi-classification word in the reference text;
the determining module 12 is configured to determine, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, the preset category with the largest prediction frequency value as the target category corresponding to the target text.
Optionally, the apparatus further comprises:
the input module is used for inputting the text sample into the text classification model;
the determination module 12 is further configured to: determining that the text sample contains multi-category words; for each multi-classification word, determining similarity loss corresponding to the multi-classification word according to similarity of a word vector of the multi-classification word in a text sample and a reference word vector of the multi-classification word, wherein the reference word vector of the multi-classification word is the word vector of the multi-classification word in a reference text; determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words;
and the adjusting module is used for adjusting the parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the apparatus further comprises:
the obtaining module is used for inputting the text sample into the text classification model to obtain the target category of the text sample;
the determining module 12 is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, acquiring the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value with the similarity is the similarity loss corresponding to the multi-classification word.
The apparatus of the foregoing embodiment may be configured to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 11 is a schematic structural diagram of a text classification device provided in an embodiment of the present disclosure, and as shown in fig. 11, the text classification device in this embodiment includes:
a memory 111 for storing memory of processor-executable instructions;
a processor 112 for implementing the method as described in any of the above figures 2-5 when the computer program is executed.
Fig. 12 is a schematic structural diagram of another text classification device provided in the embodiment of the present disclosure, and as shown in fig. 12, the device of the embodiment includes:
a memory 121, a memory for storing processor-executable instructions;
a processor 122 for implementing the method as described in any of the above figures 6-8 when the computer program is executed.
The apparatus of the foregoing embodiment may be configured to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.
The disclosed embodiments provide a computer-readable storage medium having stored therein computer-executable instructions for implementing a method of text classification as described in any of fig. 2-5 above when executed by a processor.
The disclosed embodiments provide a computer-readable storage medium having stored therein computer-executable instructions for implementing a method for text classification as described in any one of fig. 6-8 above when executed by a processor.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories and total loss of the text sample, wherein the total loss of the text sample comprises classification loss of the text sample and similarity loss of the text sample, the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset category and the total number of labeling results of the text sample, for each text sample, the text sample is labeled by m persons in the preset category, and m labeling results are obtained, wherein m is an integer greater than 1; in the case that the text sample contains multi-category words, the similarity loss of the text sample is obtained by the following method: determining similarity loss corresponding to the multi-classification words according to similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words, wherein the reference word vectors of the multi-classification words are word vectors of the multi-classification words in a reference text;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
2. The method of claim 1, wherein before inputting the target text into the text classification model and obtaining the predicted frequency values corresponding to the target text and all the preset categories, the method further comprises:
obtaining marking results of the text samples, wherein each marking result corresponds to a preset category;
for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset classes.
3. The method of claim 2, wherein the determining the labeling probability value of the text sample corresponding to the preset category according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample comprises:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
4. The method of claim 2, wherein the determining the labeling probability value of the text sample corresponding to the preset category according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample comprises:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
5. The method according to any one of claims 2 to 4, wherein the training the text classification model according to the label probability values of the text samples corresponding to all the preset categories respectively comprises:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
6. The method of claim 5, wherein after inputting the text sample into the text classification model and obtaining the predicted frequency values corresponding to the text sample and all the predetermined categories, the method further comprises:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the adjusting the parameters of the text classification model according to the classification loss of the text sample includes:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
7. The method of claim 6, wherein the entering the text sample into the text classification model comprises:
inputting the text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
8. A method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories;
before the step of inputting the target text into the text classification model and obtaining the prediction frequency values corresponding to the target text and all the preset categories respectively, the method further includes:
inputting a text sample into the text classification model;
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
adjusting parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged;
before determining that the text sample contains the multi-category word, the method further comprises:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
9. An apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories and total loss of the text sample, wherein the total loss of the text sample comprises classification loss of the text sample and similarity loss of the text sample, the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset category and the total number of labeling results of the text sample, for each text sample, the text sample is labeled by m persons in the preset category, and m labeling results are obtained, wherein m is an integer greater than 1; in the case that the text sample contains multi-category words, the similarity loss of the text sample is obtained by the following method: determining similarity loss corresponding to the multi-classification words according to similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words, wherein the reference word vectors of the multi-classification words are word vectors of the multi-classification words in a reference text;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
10. An apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories;
the input module is used for inputting the text sample into the text classification model;
the determining module is further used for determining that the text sample contains multi-category words; for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text; determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
the adjusting module is used for adjusting parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged;
the obtaining module is used for inputting the text sample into the text classification model to obtain the target category of the text sample;
the determining module is specifically configured to determine, if the target category of the text sample is the same as the reference category of the reference text, a similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, and determine a similarity loss corresponding to the multi-category word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
11. An apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method of any one of claims 1 to 7 when the executable instructions are executed.
12. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement a method of text classification as claimed in any one of claims 1 to 7.
CN202110392536.2A 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium Active CN112989051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392536.2A CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392536.2A CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112989051A CN112989051A (en) 2021-06-18
CN112989051B true CN112989051B (en) 2021-09-10

Family

ID=76338108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392536.2A Active CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112989051B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110046356A (en) * 2019-04-26 2019-07-23 中森云链(成都)科技有限责任公司 Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN111563167A (en) * 2020-07-15 2020-08-21 智者四海(北京)技术有限公司 Text classification system and method
CN112270379A (en) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 Training method of classification model, sample classification method, device and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11176330B2 (en) * 2019-07-22 2021-11-16 Advanced New Technologies Co., Ltd. Generating recommendation information

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110046356A (en) * 2019-04-26 2019-07-23 中森云链(成都)科技有限责任公司 Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN111563167A (en) * 2020-07-15 2020-08-21 智者四海(北京)技术有限公司 Text classification system and method
CN112270379A (en) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 Training method of classification model, sample classification method, device and equipment

Also Published As

Publication number Publication date
CN112989051A (en) 2021-06-18

Similar Documents

Publication Publication Date Title
CN110222178B (en) Text emotion classification method and device, electronic equipment and readable storage medium
CN109710744B (en) Data matching method, device, equipment and storage medium
CN107025284A (en) The recognition methods of network comment text emotion tendency and convolutional neural networks model
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN112667782A (en) Text classification method, device, equipment and storage medium
CN113704416A (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
EP3929800A1 (en) Skill word evaluation method and device, electronic device, and computer readable medium
CN113157859A (en) Event detection method based on upper concept information
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
CN111475648B (en) Text classification model generation method, text classification device and equipment
GB2572320A (en) Hate speech detection system for online media content
CN112989051B (en) Text classification method, device, equipment and computer readable storage medium
CN115309899B (en) Method and system for identifying and storing specific content in text
Spichakova et al. Application of Machine Learning for Assessment of HS Code Correctness.
CN115687917A (en) Sample processing method and device, and recognition model training method and device
CN115713082A (en) Named entity identification method, device, equipment and storage medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
Al Mahmud et al. A New Approach to Analysis of Public Sentiment on Padma Bridge in Bangla Text
EP2565799A1 (en) Method and device for generating a fuzzy rule base for classifying logical structure features of printed documents
CN116304058B (en) Method and device for identifying negative information of enterprise, electronic equipment and storage medium
CN113220824B (en) Data retrieval method, device, equipment and storage medium
CN116738345B (en) Classification processing method, related device and medium
CN117521673B (en) Natural language processing system with analysis training performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant