CN112989051A - Text classification method, device, equipment and computer readable storage medium - Google Patents

Text classification method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112989051A
CN112989051A CN202110392536.2A CN202110392536A CN112989051A CN 112989051 A CN112989051 A CN 112989051A CN 202110392536 A CN202110392536 A CN 202110392536A CN 112989051 A CN112989051 A CN 112989051A
Authority
CN
China
Prior art keywords
text
category
word
classification
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110392536.2A
Other languages
Chinese (zh)
Other versions
CN112989051B (en
Inventor
郭良越
丁文彪
刘琼琼
刘子韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110392536.2A priority Critical patent/CN112989051B/en
Publication of CN112989051A publication Critical patent/CN112989051A/en
Application granted granted Critical
Publication of CN112989051B publication Critical patent/CN112989051B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0242Determining effectiveness of advertisements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Finance (AREA)
  • General Engineering & Computer Science (AREA)
  • Development Economics (AREA)
  • Accounting & Taxation (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, apparatus, device, and computer-readable storage medium for text classification. The method comprises the following steps: and inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model for text classification is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset classes, and the labeling probability value corresponding to each preset class of the text samples is determined according to the number of the labeling results corresponding to the preset classes and the total number of the labeling results of the text samples. The method disclosed by the invention is based on the text classification model, so that the accuracy of text classification of the target text is higher.

Description

Text classification method, device, equipment and computer readable storage medium
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a text classification method, apparatus, device, and computer-readable storage medium.
Background
In many scenarios, the text classification is involved, and the text is classified and labeled according to a certain classification system or standard. For example, in a text review scenario, it is necessary to review whether text content is inappropriate, so as to filter out sensitive, vulgar, and advertisement content, and the text review process is actually a text classification process, for example, the text category can be defined as a normal category, an abuse category, or an advertisement category.
At present, text classification is mostly carried out by adopting a text classification model. Labels of training samples of the text classification model are usually labeled by a single person, so that the text classification model is trained according to the training samples and the labels thereof.
However, the above-mentioned way of labeling labels makes the accuracy of text classification using a text classification model not high.
Disclosure of Invention
To solve the above technical problem or to at least partially solve the above technical problem, the present disclosure provides a method, an apparatus, a device, and a computer-readable storage medium for text classification.
In a first aspect, the present disclosure provides a method for text classification, including:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, before the step of inputting the target text into the text classification model and obtaining the prediction frequency values corresponding to the target text and all the preset categories, the method further includes:
obtaining marking results of the text samples, wherein each marking result corresponds to a preset category;
for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset classes.
Optionally, the determining, according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample, a labeling probability value of the text sample corresponding to the preset category includes:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining, according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample, a labeling probability value of the text sample corresponding to the preset category includes:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training of the text classification model according to the label probability values respectively corresponding to the text sample and all the preset categories includes:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
Optionally, after the text sample is input into the text classification model and prediction frequency values respectively corresponding to the text sample and all the preset categories are obtained, the method further includes:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the adjusting the parameters of the text classification model according to the classification loss of the text sample includes:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
Optionally, the inputting the text sample into the text classification model includes:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
In a second aspect, the present disclosure provides a method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, before the step of inputting the target text into the text classification model and obtaining the prediction frequency values corresponding to the target text and all the preset categories, the method further includes:
inputting a text sample into the text classification model;
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
and adjusting parameters of the text classification model according to the similarity loss of the text samples, and returning to execute the step of inputting the text samples into the text classification model until the text classification model converges.
Optionally, before determining that the text sample contains the multi-category word, the method further includes:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
In a third aspect, the present disclosure provides an apparatus for text classification, including:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring marking results of the text samples, wherein each marking result corresponds to a preset category;
the determination module is further configured to: for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and the training module is used for training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
Optionally, the determining module is specifically configured to:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining module is specifically configured to:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training module is specifically configured to:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
Optionally, the determining module is further configured to:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the training module is specifically configured to:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
Optionally, the training module is specifically configured to:
inputting a text sample into the text classification model to obtain a target category of the text sample;
the determining module is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
In a fourth aspect, the present disclosure provides an apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
Optionally, the apparatus further comprises:
the input module is used for inputting the text sample into the text classification model;
the determination module is further to: determining that the text sample contains multi-category words; for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text; determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
and the adjusting module is used for adjusting the parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the apparatus further comprises:
the obtaining module is used for inputting the text sample into the text classification model to obtain the target category of the text sample;
the determining module is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
In a fifth aspect, the present disclosure provides an apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method according to the first aspect when executable instructions are executed.
In a sixth aspect, the present disclosure provides an apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method according to the second aspect when executable instructions are executed.
In a seventh aspect, the present disclosure provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method for text classification as described in the first aspect above when executed by a processor.
In an eighth aspect, the present disclosure provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of text classification as described in the second aspect above when executed by a processor.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
and inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model for text classification is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset classes, and the labeling probability value corresponding to each preset class of the text samples is determined according to the number of the labeling results corresponding to the preset classes and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a text classification model provided by the present disclosure;
fig. 2 is a schematic flowchart of a text classification method according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a training method of a text classification model according to an embodiment of the present disclosure;
fig. 4 is a schematic flowchart of another training method for a text classification model according to an embodiment of the present disclosure;
fig. 5 is a schematic flowchart of a training method of a text classification model according to another embodiment of the present disclosure;
FIG. 6 is a flowchart illustrating another method for text classification according to an embodiment of the present disclosure;
fig. 7 is a flowchart illustrating a method for training a text classification model according to another embodiment of the present disclosure;
fig. 8 is a flowchart illustrating a method for training a text classification model according to another embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an apparatus for text classification according to an embodiment of the present disclosure;
FIG. 10 is a schematic diagram of an apparatus for classifying text according to an embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of a text classification device according to an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of another text classification device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In many scenarios, text classification is involved, i.e., text is classified and labeled according to a certain classification system or standard.
The process of text review is also a process of text classification actually, and a specific application scenario of the present disclosure is described below by taking a text review scenario as an example, where the process of text review is to review whether the content of the target text is inappropriate, so as to filter out contents such as sensitive, popular and advertisement. The target text may be preset into a plurality of categories, that is, preset categories, for example, the preset categories may include, but are not limited to: a normal category and a violation category, etc., wherein the violation categories may include, but are not limited to, one or more of the following: an advertising category, an abuse category, or a pornographic category, etc. For any target text, the target category of the target text can be determined by examining the content of the target text. The traditional manual text auditing mode has low auditing efficiency and is difficult to support more and more business requirements.
At present, a text classification model based on machine learning is mostly adopted for text classification. The training of the text classification model is supervised training (supervised learning), i.e. the text classification model needs to be trained by using the text samples and the labels thereof. The labels used by the text classification model in the training process are used for marking the real categories of the text samples, so that the text classification model adjusts the model parameters according to the real categories, and the trained text classification model is obtained. Labels of text samples of the text classification model are usually labeled manually, wherein the labels can be labeled by a single person or by multiple persons, and if the labels are labeled by the single person, the labeled types are the labels of the text samples and participate in the training of the text classification model. Because single marking has subjectivity, a multi-person marking mode can be adopted, and therefore errors of single marking are avoided. And if the labels are marked for multiple people, selecting a mode according to the marking result marked by the multiple people, namely, using the preset category with the maximum number of marked people in the labels marked by the multiple people as the label of the text sample. And training the text classification model according to the text samples and the labels thereof. Regardless of the above single-person labeling or multi-person labeling, the finally obtained label is labeled with only one preset category, and further, it may be a one-hot (one-hot) label, for example, if the label vector is (0, 0, 0, 1, 0), only one component in the label vector has a value of 1, and the rest are 0, then the category of the text sample is the preset category represented by the component having the value of 1.
However, the manual labeling has subjectivity, that is, labels labeled by different people may be different for the same text, and on the other hand, the same text may be determined to be different preset categories in different business scenes. The accuracy of the labels of the text samples directly affects the accuracy of the training of the text classification model, and the accuracy of the text classification model is not high due to the mode.
The disclosure provides a text classification method, a text classification device, a text classification equipment and a computer readable storage medium. The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeled label is more objective, the accuracy of the text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
The model structure of the text classification model may be an Encoder (Encoder) structure of a transducer (attention-based transformer), for example, a Bidirectional Encoder from the transformer represents a pre-training model (Bert), the following description takes the working principle of the text classification model shown in fig. 1 as an example to illustrate the principle of the text classification model, fig. 1 is a schematic diagram of the text classification model provided by the present disclosure, and as shown in fig. 1, a target text 101 is: the "teacher is in class", and the text classification model 102 is input, and may be composed of a Bert and classification layers, where the Bert may include a feature extraction layer such as a word vector, a text vector, and a position vector for extracting the target text 101, an output of the Bert is a word vector (embeddings) corresponding to each word of the target text 101, and a beginning symbol (CLS) of the Bert is added according to the target text 101, and the output vector corresponding to the beginning symbol is used as a semantic representation of the target text 101 for classifying the target text 101. And taking the output vector corresponding to the CLS in the Bert as the input of a classification layer, wherein the output of the classification layer is the prediction frequency values respectively corresponding to the target text 101 and all preset classifications, and thus obtaining the target category of the target text 101 based on the prediction frequency values.
Alternatively, the word vector of the target text 101 may be obtained by the following formula (1):
Figure 888363DEST_PATH_IMAGE001
formula (1)
H is a word vector corresponding to each word of the target text, and X is the target text.
Alternatively, the classification layer may be obtained by the following formula (2) and formula (3):
z = g (Wv + b) equation (2)
Figure 930137DEST_PATH_IMAGE002
Formula (3)
Where v is a word vector corresponding to CLS, q is a predicted frequency value corresponding to a preset classification, and W and b are parameters.
The following describes the technical solutions of the present disclosure and how to solve the above problems with specific examples.
Fig. 2 is a schematic flowchart of a text classification method provided in an embodiment of the present disclosure, as shown in fig. 2, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s201, inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories.
The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The text classification model may be a model structure of the above embodiment, such as the model structure shown in fig. 1.
In this embodiment, before text classification is performed on a target text by using a text classification model, the text classification model is trained in advance. In the process of training the text classification model, a text sample needs to be labeled first to obtain labeling probability values respectively corresponding to the text sample and all preset categories, namely labels corresponding to the text sample, and then the text classification model is trained according to the text sample and the labels corresponding to the text sample.
In order to facilitate training of the text classification model, the labels may be represented by label vectors, and values in the label vectors sequentially represent label probability values corresponding to the text samples and preset categories, that is, probabilities that the categories of the text samples are the preset categories, it can be understood that each label probability value is any value greater than or equal to 0 and less than or equal to 1, and a sum of the label probability values corresponding to the text samples and all the preset categories is 1. For example, the preset categories are sequentially set in the tag vector as: normal category, advertising category, abuse category, pornographic category. If the label is (0.3, 0.7, 0, 0), it indicates that the labeling probability value of the text sample corresponding to the normal category is 0.3 (the probability of the text sample being the normal category is 0.3), and the labeling probability value of the text sample corresponding to the advertisement category is 0.7 (the probability of the text sample being the advertisement category is 0.7).
The label may be derived from a plurality of marking results. For example, m persons mark the categories of the text samples respectively, and m is an integer greater than 1, m marking results for the text samples can be obtained
Figure 663869DEST_PATH_IMAGE003
And each marking result indicates that the category of the text sample is one of the preset categories. For each text sample, the number of the marking results corresponding to the preset category and the total number m of the marking results in the m marking results can determine the label of the text sample, that is, the marking probability values of the text sample and all the preset categories respectively.
In a possible implementation manner, the ratio of the number of the labeling results corresponding to the preset category to the total number of the labeling results of the text sample is determined as a labeling probability value corresponding to the text sample and the preset category.
For example, the preset categories are set as: the method comprises the steps of a normal category, an advertisement category and an abuse category, wherein for one text sample, 5 people are used for respectively marking the text sample to obtain 5 marking results, wherein 1 marking result is the normal category, 4 marking results are the advertisement category, the probability that the text sample corresponds to the normal category is 0.2, and the probability that the text sample corresponds to the advertisement category is 0.8.
In another possible implementation manner, the ratio is obtained through the above manner, and the product of the weight corresponding to the ratio and the ratio is determined as the label probability value corresponding to the text sample and the preset category.
After the text classification model is trained, the text classification model can be used for performing text classification on the target text, and the target text is input into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes.
S202, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text.
And the preset category corresponding to the maximum prediction frequency value in the prediction frequency values respectively corresponding to the target text and all the preset categories is the target category corresponding to the target text. For example, the predicted frequency value of the output of the text classification model sequentially represents the predicted frequency values of the preset category sequence: normal category, advertising category, abuse category, pornographic category. Assuming that the target text is input into the text classification model, and the output of the text classification model is 0.2, 0.8, 0 and 0 in sequence, the prediction frequency value corresponding to the target text and the normal category is 0.2, the prediction frequency value corresponding to the target text and the advertisement category is 0.8, and the prediction frequency values corresponding to the target text, the abuse category and the pornographic category are both 0. And if the maximum prediction frequency value is 0.8, the corresponding preset category is the advertisement category, and the category of the target text is the advertisement category.
In this embodiment, the prediction frequency values respectively corresponding to the target text and all the preset categories are obtained by inputting the target text into the text classification model. And determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories. The text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Fig. 3 is a schematic flowchart of a training method for a text classification model according to an embodiment of the present disclosure, as shown in fig. 3, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s301, obtaining a marking result of the text sample.
Each marking result corresponds to a preset category, namely each marking result indicates that the category of the text sample is one of the preset categories.
The training sample set for training the text classification model may include a plurality of text samples, each text sample corresponding to a plurality of labeled results. For example, for each text sample, the text sample can be labeled by m individuals in the preset category, and then m labeling results are obtained
Figure 918133DEST_PATH_IMAGE004
Wherein m is an integer greater than 1.
S302, aiming at each preset category, determining a labeling probability value corresponding to the preset category of the text sample according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample.
And determining the labeling probability values of the text samples corresponding to the preset categories according to the number of the labeling results corresponding to the preset categories and the total number of the labeling results of the text samples for each text sample and each preset category. For example, for each text sample, m individuals may mark the text sample in a preset category to obtain m marking results, and among the m marking results, the number of marking results corresponding to the preset category and the total number m of marking results may determine the marking probability values respectively corresponding to the text sample and all the preset categories, that is, the label of the text sample may be determined.
In a possible implementation manner, for each preset category, a ratio of the number of the labeling results corresponding to the preset category to the total number of the labeling results of the text sample is determined, and the ratio is a labeling probability value corresponding to the text sample and the preset category.
Assuming that there are k preset categories, each preset category is represented by an integer greater than 0, for each text sample, there are m labeled results of
Figure 27778DEST_PATH_IMAGE003
Wherein, in the step (A),
Figure 66885DEST_PATH_IMAGE005
j is any value from 1 to m, e.g. the first marked result
Figure 537180DEST_PATH_IMAGE006
If =2, the first labeling result indicates that the text sample is in the second preset category, and the labeling probability value of the text sample corresponding to the preset category can be obtained by the following formula (4):
Figure 831021DEST_PATH_IMAGE007
Figure 218009DEST_PATH_IMAGE008
formula (4)
Wherein the content of the first and second substances,
Figure 303383DEST_PATH_IMAGE009
is a labeling probability value corresponding to the ith preset category, m is the total number of labeling results,
Figure 526554DEST_PATH_IMAGE010
the jth marked result.
For example, the preset categories are set as: the method comprises the steps of a normal category, an advertisement category and an abuse category, wherein for one text sample, 5 people are used for respectively marking the text sample, and 5 marking results are respectively 1, 2, 2, 2 and 2, wherein 1 marking result is the normal category, 4 marking results are the advertisement category, the probability that the text sample corresponds to the normal category is 0.2, and the probability that the text sample corresponds to the advertisement category is 0.8.
For example, in order to facilitate training of the text classification model, the label and the labeling result may be represented by a label vector, and values in the label vector sequentially represent labeling probability values of the text sample corresponding to preset categories, that is, probabilities that the categories of the text sample are the preset categories. For example, the preset categories are sequentially set in the tag vector as: normal category, advertising category, abuse category. If a certain labeling result is that the text sample is of a normal category, the labeling result can be represented by a label vector (1, 0, 0).
In another possible implementation manner, for each preset category, a ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample is obtained.
The method for obtaining the ratio is already described in the above implementation, and is not described herein.
And obtaining the product of the weight value corresponding to the ratio and the ratio. And determining the product as a mark probability value corresponding to the text sample and the preset category.
In order to increase the confidence level of the maximum prediction frequency value, the ratio r can be further transformed to obtain a label probability value corresponding to the text sample and the preset category, namely a softened label vector. For example, a weight corresponding to the ratio may be set, and a product of the weight corresponding to the ratio and the ratio is a labeling probability value corresponding to the text sample and the preset category. For example, the weight corresponding to the maximum ratio is set as a first weight, the weights corresponding to other ratios except the maximum ratio are set as second weights, the second weights are integers smaller than 1, and the first weights are integers larger than 1.
Optionally, the labeling probability value of the text sample corresponding to the preset category may be obtained by the following formula (5):
Figure 591462DEST_PATH_IMAGE011
formula (5)
Wherein the content of the first and second substances,
Figure 459055DEST_PATH_IMAGE012
is a label probability value corresponding to the ith preset category,
Figure 777647DEST_PATH_IMAGE013
the mode is taken for the m marked results,
Figure 878327DEST_PATH_IMAGE014
is a preset parameter, and
Figure 45129DEST_PATH_IMAGE014
≥1。
the confidence level of the preset category of the high ticket can be increased, namely, the maximum ratio is increased, and other ratios except the maximum ratio are reduced. For example, if pornography, abuse, advertising, normal quartering task is performed, the "teacher face" target text is of two of the three labels, labeled abuse category, and one is labeled normal category, then the voting vector r = (0, 0.67, 0, 0.33) for the target text, assuming α =3, and the resulting label vector is p = (0, 0.89, 0, 0.11).
Optionally, if α =1, p = r, the obtained ratio is not processed, and the obtained ratio is directly used as a label probability value corresponding to the text sample and the preset category. If the alpha = ∞, only one item of p is 1, the rest are 0, and the labeling probability value of the text sample corresponding to the preset category is a one-hot vector obtained by taking a mode.
And S303, training a text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
In the training process, the classification loss of each text sample can be calculated, then the parameters of the text classification model are adjusted according to the classification loss of each text sample, the text sample is input into the text classification model in a return execution mode until the text classification model is converged, and the converged text classification model is obtained.
It is understood that the steps in the embodiment shown in fig. 3 may be performed separately for training the text classification model, or may be performed before S201, so that after the text classification model is trained, the target text is classified by using the text classification model.
In this embodiment, by obtaining the labeling results of the text samples, where each labeling result corresponds to one preset category, for each preset category, the labeling probability values of the text samples corresponding to the preset categories are determined according to the number of the labeling results corresponding to the preset categories and the total number of the labeling results of the text samples, and the text classification model is trained according to the labeling probability values of the text samples corresponding to all the preset categories. The labeling probability value considers the number of the plurality of labeling results and the total number of the labeling results, so that the labeling label is more objective, the accuracy of a text classification model trained based on the labeled label is higher, and the accuracy of text classification of the target text is higher based on the text classification model.
Fig. 4 is a schematic flowchart of another training method for a text classification model according to an embodiment of the present disclosure, and fig. 4 is a flowchart of the embodiment shown in fig. 3, further, as shown in fig. 4, S303 may be implemented by the following steps S3031, S3032, S3033, and S3034:
s3031, inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all preset classes.
During training, multiple batches of training can be performed, and for each batch, a plurality of text samples are input into the text classification model for training.
S3032, aiming at each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the mark probability value corresponding to the target category.
Alternatively, the classification loss of each text sample can be obtained by the following formula (6):
Figure 850197DEST_PATH_IMAGE016
formula (6)
Wherein the content of the first and second substances,
Figure 782381DEST_PATH_IMAGE017
for the classification loss of the S-th text sample,
Figure 77840DEST_PATH_IMAGE012
a labeling probability value corresponding to the ith preset category for the S text sample,
Figure 513501DEST_PATH_IMAGE018
and marking the probability value of the output of the text classification model corresponding to the ith preset category.
S3033, adjusting parameters of the text classification model according to the classification loss of the text sample.
From the classification loss of the text samples, an overall classification loss of the text classification model may be determined. And adjusting parameters of the text classification model according to the total classification loss.
Alternatively, the total classification loss of the text classification model can be obtained by the following formula (7):
Figure 878623DEST_PATH_IMAGE019
formula (7)
Wherein the content of the first and second substances,
Figure 517677DEST_PATH_IMAGE020
for the total classification loss of the text classification model,
Figure 869024DEST_PATH_IMAGE017
is the classification loss of the S-th text sample.
S3034, judging whether the text classification model is converged.
Wherein the total classification loss of the text classification model is calculated for each batch.
Determining convergence of the text classification model, wherein the convergence of the text classification model can be determined through the classification loss smaller than a first preset threshold value; determining the convergence of the text classification model according to the fact that the change value of the classification loss obtained through multiple times of training is smaller than a second preset threshold; the convergence of the text classification model can also be determined in other ways, and the disclosure is not limited for determining the condition of the convergence of the text classification model.
And if the text classification model is converged, stopping training the text classification model. If the text classification model is not converged, the step S3031 is returned to until the text classification model is converged.
If only the single one-hot label is used for calculating the classification loss, the classification loss is determined by the preset category corresponding to the highest probability value in the one-hot label, and in the embodiment, the classification loss of each text sample is determined by the prediction frequency value and the label probability value of each preset category, and the situations of a plurality of preset categories are considered, so that the learning capability and the generalization capability of the text classification model are improved, and the accuracy of the text classification model is higher.
In other scenarios, in the text classification process, some words appear in different texts, and the preset types of the texts are different, for example, a certain word appears frequently in one preset type of text, but sometimes appears in other preset types of text. For example, in the context of text review, text belonging to the violation category typically contains certain ambiguous words that appear frequently in the violation sample, e.g., "panning" belonging to the advertising category, "swine" belonging to the abuse category, etc., but the text containing these words is not always the violation content. In the process of training the text classification model, if the text sample contains multi-class words, the text classification model only tends to judge whether the multi-class words exist, but fails to grasp the meaning of the multi-class words in the context sentence, so that the text classification model learns wrong information, and the accuracy of the text classification model is not high. In text review, many times, whether a word contains a "violation" level of meaning may not be learned by the model. Aiming at the technical problems, in the process of training the text classification model, the influence of multi-classification words in the text sample on the text classification model needs to be considered, and the similarity loss is determined according to the similarity of word vectors of the multi-classification words and reference word vectors of the multi-classification words contained in the text sample; the reference word vector of the multi-category word is the word vector of the multi-category word in the reference text. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 5 is a schematic flowchart of a training method of another text classification model provided in an embodiment of the present disclosure, and fig. 5 is based on the embodiment shown in fig. 4, further, as shown in fig. 5, after S3031, the following steps S3035, S3036, and S3037 may be further included, and accordingly, S3033 may be S30331:
s3035, judging whether the text sample contains the multi-classification words.
And judging whether the text sample contains multi-classification words or not according to each text sample, wherein a multi-classification word set can be preset, the multi-classification word set contains a plurality of multi-classification words, and whether the text sample contains the multi-classification words in the multi-classification word set or not is determined according to comparison between the multi-classification word set and the text sample.
If the text sample does not contain the multi-classification words, the similarity loss is 0, namely the similarity loss can be not calculated, and the text classification model is trained according to the existing process. If the text sample contains the multi-category word, S3036 is continued.
S3036, aiming at each multi-classification word, determining the similarity loss corresponding to the multi-classification word according to the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word.
The reference word vector of the multi-category word is a word vector of the multi-category word in the reference text.
Optionally, the word vector of the multi-class word in the text classification model may be obtained through the text classification model. As shown in fig. 1, in the BERT in the text classification model 102, a word vector of each word in the text is output, and if there are multiple classified words, the word vector of the multiple classified words can be directly obtained through output of the BERT.
Further, the process of determining the similarity loss may include:
and judging whether the target class of the text sample is the same as the reference class of the reference text or not.
And if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word.
Optionally, if the target category of the text sample is the same as the reference category of the reference text, the target category can be obtained by the following formula (8):
Figure 408458DEST_PATH_IMAGE021
formula (8)
Wherein the content of the first and second substances,
Figure 443017DEST_PATH_IMAGE022
in order to be a multi-category word,
Figure 84214DEST_PATH_IMAGE023
is composed of
Figure 98306DEST_PATH_IMAGE024
The corresponding loss is a loss of the power,
Figure 524871DEST_PATH_IMAGE025
is composed of
Figure 107162DEST_PATH_IMAGE024
The word vector of (a) is,
Figure 219343DEST_PATH_IMAGE026
is composed of
Figure 559099DEST_PATH_IMAGE024
Corresponding reference word directionAmount of the compound (A).
And if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
Optionally, if the target category of the text sample is different from the reference category of the reference text, the similarity loss corresponding to the multi-category word may be obtained by the following formula (9):
Figure 558279DEST_PATH_IMAGE027
formula (9)
Wherein the content of the first and second substances,
Figure 560739DEST_PATH_IMAGE022
in order to be a multi-category word,
Figure 802627DEST_PATH_IMAGE023
is composed of
Figure 33888DEST_PATH_IMAGE022
The corresponding loss is a loss of the power,
Figure 933580DEST_PATH_IMAGE025
is composed of
Figure 418526DEST_PATH_IMAGE024
The word vector of (a) is,
Figure 177403DEST_PATH_IMAGE026
is composed of
Figure 369612DEST_PATH_IMAGE022
The corresponding reference word vector.
S3037, determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words.
After the similarity corresponding to each multi-classification word is obtained, the similarity loss of the text sample can be determined according to the similarity corresponding to each multi-classification word. Further, the method can be used for preparing a novel materialAnd finally, adding the similarity losses of all the multi-category words corresponding to the text sample to obtain the similarity loss corresponding to the text sample. Illustratively, for the S-th text sample, the similarity loss corresponding to each multi-category word in the S-th text sample is obtained through the above formula (9)
Figure 389390DEST_PATH_IMAGE023
Adding the similarity losses corresponding to each of the multiple classified words to obtain the similarity loss of the S-th text sample
Figure 687647DEST_PATH_IMAGE028
Optionally, the similarity loss corresponding to the text classification model is determined according to the similarity loss of the text sample. For example, the similarity loss corresponding to the text classification model may be a result of adding similarity losses of all text samples, and the similarity loss corresponding to the text classification model may be obtained by the following formula (10):
Figure 432356DEST_PATH_IMAGE029
formula (10)
Wherein the content of the first and second substances,
Figure 254687DEST_PATH_IMAGE030
to the loss of similarity of the text classification model,
Figure 974644DEST_PATH_IMAGE028
the similarity loss for the S-th text sample.
In an exemplary manner, the first and second electrodes are,
Figure 974961DEST_PATH_IMAGE028
obtainable by the above formula (9)
Figure 36326DEST_PATH_IMAGE023
Adding and obtaining.
And S30331, adjusting parameters of the text classification model according to the classification loss of the text samples and the similarity loss of the text samples.
And adjusting parameters of the text classification model together according to the classification loss of the text samples and the similarity loss of the text samples.
Optionally, the total loss of the text classification model may be obtained according to the classification loss of the text sample and the similarity loss of the text sample, so that the parameter of the text classification model is adjusted according to the total loss. Wherein, the total loss of the text classification model can be obtained by the following formula (11):
Figure 413081DEST_PATH_IMAGE031
formula (11)
Wherein L is the total loss of the text classification model,
Figure 250194DEST_PATH_IMAGE020
in order to account for the overall classification loss,
Figure 936259DEST_PATH_IMAGE030
is the overall similarity loss.
Figure 127331DEST_PATH_IMAGE020
Can be obtained from the formula (7),
Figure 573356DEST_PATH_IMAGE030
can be obtained from equation (10).
It is understood that S3035, S3036 and S3037 are not in sequence with the execution of S3032, and S3032 may be executed first, and then S3035, S3036 and S3037 may be executed. S3035, S3036 and S3037 may be performed first, then S3032 may be performed, or S3035, S3036 and S3037 may be performed simultaneously with S3032.
In this embodiment, a text sample is determined to include multiple classified words, and for each multiple classified word, a similarity loss corresponding to the multiple classified word is determined according to a similarity between a word vector of the multiple classified word in the text sample and a reference word vector of the multiple classified word, where the reference word vector of the multiple classified word is a word vector of the multiple classified word in a reference text, and the similarity loss of the text sample is determined according to the similarity loss corresponding to all the multiple classified words. Because the similarity loss considers the semantic information of the context of the multi-classification words, the calculation of the similarity loss is increased in the process of training the text classification model, and the parameters of the text classification model are adjusted based on the similarity loss, so that the text classification model learns the difference of the multi-classification words, and the text classification model has higher accuracy in text classification.
Fig. 6 is a schematic flowchart of another text classification method provided in an embodiment of the present disclosure, as shown in fig. 6, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster having computing and processing capabilities, where the present disclosure is not limited thereto, and the method of this embodiment is as follows:
s601, inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories.
The text classification model is obtained by adjusting according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words and reference word vectors of the multi-classification words contained in a text sample; the reference word vector of the multi-category word is the word vector of the multi-category word in the reference text.
The text classification model is trained in advance before the text classification model is used for text classification of the target text. And in the process of training the text classification model, increasing the calculation of similarity loss, thereby adjusting the text classification model according to the similarity loss. And detecting whether the text sample contains multi-classification words or not according to each text sample, wherein a multi-classification word set can be preset and contains a plurality of multi-classification words, and determining whether the text sample contains the multi-classification words in the multi-classification word set or not according to comparison between the multi-classification word set and the text sample. If the text sample does not contain the multi-classification words, the similarity loss is 0, namely the similarity loss can be not calculated, and the text classification model is trained according to the existing process. If the text sample contains multi-category words, the similarity loss is determined according to the similarity of the word vector of each multi-category word in the text classification model and the reference word vector of the multi-category word. In the process of training the text classification model, the implementation principle of the process of adjusting the model according to the similarity loss is similar to that of the embodiment shown in fig. 5, and is not repeated here.
S602, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text.
The implementation principle of S602 is similar to that of S202, and is not described here again.
In this embodiment, the target text is input into the text classification model to obtain prediction frequency values corresponding to the target text and all the preset categories, and the preset category with the largest prediction frequency value is determined as the target category corresponding to the target text according to the prediction frequency values corresponding to the target text and all the preset categories. The text classification model is obtained through training according to the similarity loss, and the similarity loss is obtained according to the word vectors of the multi-classification words and the reference word vectors of the multi-classification words contained in the text sample. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 7 is a schematic flowchart of a training method for a text classification model according to another embodiment of the present disclosure, as shown in fig. 7, the method of this embodiment is executed by any device, equipment, platform, or equipment cluster with computing and processing capabilities, and the present disclosure is not limited thereto, and the method of this embodiment is as follows:
and S701, inputting the text sample into a text classification model.
S702, judging whether the text sample contains multi-classification words.
And if the text sample does not contain the multi-classification words, training a text classification model according to the existing process.
If the text sample contains multiple classified words, the process continues to step S703.
The implementation principle and implementation manner of steps S702 and S3035 are similar, and are not described herein again.
And S703, determining the similarity loss corresponding to the multi-category words according to the similarity between the word vector of the multi-category words in the text sample and the reference word vector of the multi-category words aiming at each multi-category word.
The reference word vector of the multi-category word is a word vector of the multi-category word in the reference text.
The implementation principle and implementation manner of steps S703 and S3036 are similar, and are not described herein again.
And S704, determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words.
The implementation principle and implementation manner of steps S704 and S3037 are similar, and are not described herein again.
S705, adjusting parameters of the text classification model according to the similarity loss of the text samples.
And S706, judging whether the text classification model is converged.
Determining convergence of the text classification model, wherein the convergence of the text classification model can be determined by that the similarity loss is smaller than a third preset threshold value; the convergence of the text classification model can also be determined according to the fact that the variation value between the similarity losses obtained through multiple training is smaller than a fourth preset threshold, and the disclosure is not limited for determining the convergence condition of the text classification model.
And if the text classification model is converged, stopping training the text classification model. If the text classification model is not converged, the step returns to execute S701 until the text classification model is converged.
It is to be understood that the steps in the embodiment shown in fig. 7 may be performed separately for training the text classification model, or may be performed before S601, so that after the text classification model is trained, the target text is classified by using the text classification model.
In this embodiment, a text sample is determined to include multiple classified words, and for each multiple classified word, a similarity loss corresponding to the multiple classified word is determined according to a similarity between a word vector of the multiple classified word in the text sample and a reference word vector of the multiple classified word, where the reference word vector of the multiple classified word is a word vector of the multiple classified word in a reference text. And determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words. And adjusting parameters of the text classification model according to the similarity loss of the text samples. The similarity loss considers the semantic information of the context of the multi-classification words, and the calculation of the similarity loss is added in the training process, so that the difference of the multi-classification words is learned by the text classification model, and the accuracy of the text classification model is higher during text classification.
Fig. 8 is a schematic flowchart of a training method of another text classification model provided in an embodiment of the present disclosure, and fig. 8 is based on the embodiment shown in fig. 7, further, as shown in fig. 8, S701 may include S7011, and S703 may include S7031, S7032, S7033, and S7034:
s7011, inputting the text sample into the text classification model to obtain the target category of the text sample.
S7031, it is determined whether the target type of the text sample is the same as the reference type of the reference text.
If the target type of the text sample is the same as the reference type of the reference text, S7032 is continuously performed. If the target type of the text sample is different from the reference type of the reference text, S7033 is continuously executed.
S7032, determining similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words, and determining similarity loss corresponding to the multi-classification words.
S7033, obtaining similarity of word vectors of the multi-classification words in the text sample and reference word vectors of the multi-classification words.
And S7034, determining a difference value obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-category words.
The method of this embodiment, which is implemented in a manner similar to the principle and manner of S3036, is not described herein again.
Fig. 9 is a schematic structural diagram of a text classification apparatus provided in an embodiment of the present disclosure, and as shown in fig. 9, the apparatus of the embodiment includes:
an obtaining module 91, configured to input the target text into the text classification model, and obtain prediction frequency values corresponding to the target text and all preset categories, respectively; the text classification model is obtained by training according to labeling probability values respectively corresponding to the text samples and all preset categories, and the labeling probability value corresponding to each preset category of the text samples is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text samples;
the determining module 92 is configured to determine, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, the preset category with the largest prediction frequency value as the target category corresponding to the target text.
Optionally, the apparatus further comprises:
the acquisition module is used for acquiring marking results of the text samples, wherein each marking result corresponds to a preset category;
the determination module 92 is further configured to: for each preset category, determining a labeling probability value corresponding to the text sample and the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and the training module is used for training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset categories.
Optionally, the determining module 92 is specifically configured to:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample, wherein the ratio is the marking probability value corresponding to the text sample and the preset category.
Optionally, the determining module 92 is specifically configured to:
acquiring the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
Optionally, the training module is specifically configured to:
inputting the text sample into a text classification model to obtain prediction frequency values respectively corresponding to the text sample and all preset classes;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the determining module 92 is further configured to:
determining that the text sample contains multi-category words;
for each multi-classification word, determining similarity loss corresponding to the multi-classification word according to similarity of a word vector of the multi-classification word in a text sample and a reference word vector of the multi-classification word, wherein the reference word vector of the multi-classification word is the word vector of the multi-classification word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words;
correspondingly, the training module is specifically configured to:
and adjusting parameters of the text classification model according to the classification loss of the text samples and the similarity loss of the text samples.
Optionally, the training module is specifically configured to:
inputting the text sample into a text classification model to obtain a target category of the text sample;
the determining module 92 is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
Fig. 10 is a schematic structural diagram of an apparatus for classifying texts according to another embodiment of the present disclosure, as shown in fig. 10, the apparatus provided in this embodiment includes:
the obtaining module 11 is configured to input the target text into the text classification model, and obtain prediction frequency values corresponding to the target text and all preset categories respectively; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words and reference word vectors of the multi-classification words contained in a text sample; the reference word vector of the multi-classification word is a word vector of the multi-classification word in the reference text;
the determining module 12 is configured to determine, according to the prediction frequency values respectively corresponding to the target text and all the preset categories, the preset category with the largest prediction frequency value as the target category corresponding to the target text.
Optionally, the apparatus further comprises:
the input module is used for inputting the text sample into the text classification model;
the determination module 12 is further configured to: determining that the text sample contains multi-category words; for each multi-classification word, determining similarity loss corresponding to the multi-classification word according to similarity of a word vector of the multi-classification word in a text sample and a reference word vector of the multi-classification word, wherein the reference word vector of the multi-classification word is the word vector of the multi-classification word in a reference text; determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-classification words;
and the adjusting module is used for adjusting the parameters of the text classification model according to the similarity loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model is converged.
Optionally, the apparatus further comprises:
the obtaining module is used for inputting the text sample into the text classification model to obtain the target category of the text sample;
the determining module 12 is specifically configured to:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity of the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
The apparatus of the foregoing embodiment may be configured to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 11 is a schematic structural diagram of a text classification device provided in an embodiment of the present disclosure, and as shown in fig. 11, the text classification device in this embodiment includes:
a memory 111 for storing processor-executable instructions;
processor 112, when executed, is configured to implement the method as described above with respect to any of fig. 2-5.
Fig. 12 is a schematic structural diagram of another text classification device provided in the embodiment of the present disclosure, and as shown in fig. 12, the device of the embodiment includes:
a memory 121 for storing processor-executable instructions;
processor 122, when executing executable instructions, is configured to implement the method as described in any of fig. 6-8 above.
The apparatus of the foregoing embodiment may be configured to implement the technical solution of the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.
The disclosed embodiments provide a computer-readable storage medium having stored therein computer-executable instructions for implementing a method of text classification as described in any of fig. 2-5 above when executed by a processor.
The disclosed embodiments provide a computer-readable storage medium having stored therein computer-executable instructions for implementing a method for text classification as described in any one of fig. 6-8 above when executed by a processor.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
2. The method of claim 1, wherein before inputting the target text into the text classification model and obtaining the predicted frequency values corresponding to the target text and all the preset categories, the method further comprises:
obtaining marking results of the text samples, wherein each marking result corresponds to a preset category;
for each preset category, determining a labeling probability value of the text sample corresponding to the preset category according to the number of labeling results corresponding to the preset category and the total number of the labeling results of the text sample;
and training the text classification model according to the labeling probability values respectively corresponding to the text samples and all the preset classes.
3. The method of claim 2, wherein the determining the labeling probability value of the text sample corresponding to the preset category according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample comprises:
and determining the ratio of the number of the marking results corresponding to the preset category to the total number of the marking results of the text sample as the marking probability value corresponding to the text sample and the preset category.
4. The method of claim 2, wherein the determining the labeling probability value of the text sample corresponding to the preset category according to the number of the labeling results corresponding to the preset category and the total number of the labeling results of the text sample comprises:
acquiring the number of the marking results corresponding to the preset category and the ratio of the total number of the marking results of the text sample;
and obtaining the product of the weight value corresponding to the ratio and the ratio, and determining the product as the labeling probability value corresponding to the text sample and the preset category.
5. The method according to any one of claims 1 to 4, wherein the training the text classification model according to the label probability values of the text samples corresponding to all the preset categories respectively comprises:
inputting the text sample into the text classification model to obtain prediction frequency values respectively corresponding to the text sample and all the preset categories;
for each preset category, determining the classification loss of the text sample according to the prediction frequency value corresponding to the preset category and the label probability value corresponding to the preset category;
and adjusting parameters of the text classification model according to the classification loss of the text sample, and returning to execute the step of inputting the text sample into the text classification model until the text classification model converges.
6. The method of claim 5, wherein after inputting the text sample into the text classification model and obtaining the predicted frequency values corresponding to the text sample and all the predetermined categories, the method further comprises:
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
correspondingly, the adjusting the parameters of the text classification model according to the classification loss of the text sample includes:
and adjusting parameters of the text classification model according to the classification loss of the text sample and the similarity loss of the text sample.
7. The method of claim 6, wherein the entering the text sample into the text classification model comprises:
inputting the text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
and if the target category of the text sample is different from the reference category of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the difference obtained by subtracting the similarity from 1 as the similarity loss corresponding to the multi-classification word.
8. A method of text classification, comprising:
inputting a target text into a text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset categories; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
9. The method of claim 8, wherein before inputting the target text into the text classification model and obtaining the predicted frequency values corresponding to the target text and all the predetermined categories, the method further comprises:
inputting a text sample into the text classification model;
determining that the text sample contains multi-category words;
for each multi-category word, determining similarity loss corresponding to the multi-category word according to similarity between a word vector of the multi-category word in the text sample and a reference word vector of the multi-category word, wherein the reference word vector of the multi-category word is the word vector of the multi-category word in a reference text;
determining the similarity loss of the text sample according to the similarity loss corresponding to all the multi-category words;
and adjusting parameters of the text classification model according to the similarity loss of the text samples, and returning to execute the step of inputting the text samples into the text classification model until the text classification model converges.
10. The method of claim 9, wherein prior to determining that the text sample contains the multi-category word, further comprising:
inputting a text sample into the text classification model to obtain a target category of the text sample;
determining a similarity loss corresponding to the multi-category word according to the similarity between the word vector of the multi-category word in the text sample and the reference word vector of the multi-category word, including:
if the target category of the text sample is the same as the reference category of the reference text, determining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining the similarity loss corresponding to the multi-classification word;
if the target type of the text sample is different from the reference type of the reference text, obtaining the similarity between the word vector of the multi-classification word in the text sample and the reference word vector of the multi-classification word, and determining that a difference value between the similarity and the similarity is the similarity loss corresponding to the multi-classification word.
11. An apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to labeling probability values respectively corresponding to a text sample and all preset categories, and the labeling probability value corresponding to each preset category of the text sample is determined according to the number of labeling results corresponding to the preset categories and the total number of the labeling results of the text sample;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
12. An apparatus for text classification, comprising:
the obtaining module is used for inputting the target text into the text classification model to obtain prediction frequency values respectively corresponding to the target text and all preset classes; the text classification model is obtained by training according to similarity loss, and the similarity loss is determined according to the similarity of word vectors of multi-classification words contained in a text sample and reference word vectors of the multi-classification words; the reference word vector of the multi-classification word is a word vector of the multi-classification word in a reference text;
and the determining module is used for determining the preset category with the maximum prediction frequency value as the target category corresponding to the target text according to the prediction frequency values respectively corresponding to the target text and all the preset categories.
13. An apparatus for text classification, comprising:
a memory for storing processor-executable instructions;
a processor for implementing the method of any one of claims 1 to 7 when the executable instructions are executed.
14. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, implement a method of text classification as claimed in any one of claims 1 to 7.
CN202110392536.2A 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium Active CN112989051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110392536.2A CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110392536.2A CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112989051A true CN112989051A (en) 2021-06-18
CN112989051B CN112989051B (en) 2021-09-10

Family

ID=76338108

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110392536.2A Active CN112989051B (en) 2021-04-13 2021-04-13 Text classification method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112989051B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110046356A (en) * 2019-04-26 2019-07-23 中森云链(成都)科技有限责任公司 Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN111563167A (en) * 2020-07-15 2020-08-21 智者四海(北京)技术有限公司 Text classification system and method
CN112270379A (en) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 Training method of classification model, sample classification method, device and equipment
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008342A (en) * 2019-04-12 2019-07-12 智慧芽信息科技(苏州)有限公司 Document classification method, apparatus, equipment and storage medium
CN110046356A (en) * 2019-04-26 2019-07-23 中森云链(成都)科技有限责任公司 Label is embedded in the application study in the classification of microblogging text mood multi-tag
US20210027018A1 (en) * 2019-07-22 2021-01-28 Advanced New Technologies Co., Ltd. Generating recommendation information
CN110705274A (en) * 2019-09-06 2020-01-17 电子科技大学 Fusion type word meaning embedding method based on real-time learning
CN111563167A (en) * 2020-07-15 2020-08-21 智者四海(北京)技术有限公司 Text classification system and method
CN112270379A (en) * 2020-11-13 2021-01-26 北京百度网讯科技有限公司 Training method of classification model, sample classification method, device and equipment

Also Published As

Publication number Publication date
CN112989051B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
WO2020140372A1 (en) Recognition model-based intention recognition method, recognition device, and medium
CN109710744B (en) Data matching method, device, equipment and storage medium
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN107180084B (en) Word bank updating method and device
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
US8788503B1 (en) Content identification
CN109284374B (en) Method, apparatus, device and computer readable storage medium for determining entity class
CN106506327B (en) Junk mail identification method and device
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN113254643B (en) Text classification method and device, electronic equipment and text classification program
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN111522908A (en) Multi-label text classification method based on BiGRU and attention mechanism
CN112667782A (en) Text classification method, device, equipment and storage medium
CN110909144A (en) Question-answer dialogue method and device, electronic equipment and computer readable storage medium
CN112417132A (en) New intention recognition method for screening negative samples by utilizing predicate guest information
CN114925702A (en) Text similarity recognition method and device, electronic equipment and storage medium
CN112307210A (en) Document tag prediction method, system, medium and electronic device
CN112989051B (en) Text classification method, device, equipment and computer readable storage medium
CN111681731A (en) Method for automatically marking colors of inspection report
CN115687917A (en) Sample processing method and device, and recognition model training method and device
CN115309899A (en) Method and system for identifying and storing specific content in text
Chen et al. Learning the chinese sentence representation with LSTM autoencoder
CN114780678A (en) Text retrieval method, device, equipment and storage medium
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant